Preface {-}

Placeholder

Contributing {-}

Bibliographic Note {-}

Rendering Mathematical Formulae {-}

R Code Conventions {-}

Acknowledgements {-}

License {-}

Introduction

Placeholder

Time Series

Exploratory Data Analysis for Time Series {#eda}

Basic Time Series Models {#basicmodels}

White noise processes {#wn}

Random Walk Processes {#rw}

Autoregressive Process of Order 1 {#ar1}

Moving Average Process of Order 1 {#ma1}

Linear Drift {#drift}

Composite Stochastic Processes {#lts}

Autocorrelation and Stationarity

Placeholder

The Autocorrelation and Autocovariance Functions

A Fundamental Representation

Admissible Autocorrelation Functions

Stationarity {#stationary}

Assessing Weak Stationarity of Time Series Models

Estimation of Moments of Stationary Processes

Estimation of the Mean Function

Sample Autocovariance and Autocorrelation Functions

Robustness Issues

Joint Stationarity

Sample Cross-Covariance and Cross-Correlation Functions

Portmanteau test

Proof

Exercises


output: pdf_document: default html_document: default


Autoregressive Moving Average Models

library(simts)
library(ggplot2)
library(robcor)

"Essentially, all models are wrong, but some are useful", George Box

In this chapter we introduce a class of time series models that is flexible and among the most commonly used to describe stationary time series. This class is represented by the AutoRegressive Moving Average (ARMA) models which combine and include the AutoRegressive (AR) and Moving Average (MA) models seen in the previous chapter. We first discuss AR and MA models in further detail before introducing the general ARMA class. The importance of the latter class of models resides in its flexibility as well as its capacity of describing (or closely approximating) the features of stationary time series. The autoregressive parts of these models describe how consecutive observations in time influence each other while the moving average parts capture some possible unobserved shocks thereby allowing to model different phenomena going from biology to finance.

To introduce and explain this class of models, this chapter is organized in the following manner. First of all we will discuss the class of linear processes of which the ARMA models are part of. We will then proceed to a detailed description of AR models in which we review their definition, explain their properties, introduce the main estimation methods for their parameters and highlight the diagnostic tools which can help understand if the estimated models appear to be appropriate or sufficient to well describe the observed time series. Once this is done, we will then use most of the results given for AR models to further describe and discuss MA models, for which we underline the property of invertibility, and finally the ARMA models. Indeed, the properties and estimation methods for the latter class are directly inherited from the discussions on AR and MA models.

Linear Operators and Processes

Within this section, the concepts of linear operators and processes will be discussed thereby providing some theoretical bases to model a variety of time series data. In fact, every time series that is weakly stationary is either a linear process or can be redefined as such after being transformed using a linear operation such as, for example, removing a trend through differencing.

Linear Operators

Prior to introducing linear processes, we will first discuss the concept of linear operators that are necessary to better understand the former. These operators allow to manipulate time series in order to transform them into a form that can be represented by the models presented in the previous chapters (and the ARMA models presented further on). Generally speaking, these operators serve three main goals:

```{definition, label="linop", name = "Linear Operator"} A linear operator $L$ is an operator that satisfies the following two conditions:

[\begin{aligned} L\left( {{X_t} + {Y_t}} \right) &= L{X_t} + L{Y_t} & \text{(Preserves Addition)}\ L\left( {\lambda \cdot {X_t}} \right) &= \lambda \cdot \left( {L{X_t}} \right) & \text{(Preserves Scalar Multiplication)} \ \end{aligned} ]

Given these definitions, the following subsections will investigate some specific linear operators that are particularly suited to manipulate the time series that are treated within this chapter.

#### Backshift (Lag) Operator

The backshift operator is a linear operator which can be used to define the other operators that will be discussed in the following paragraphs. The purpose of this operator is to change the
indices of the time series by one period (e.g. $t \rightarrow t-1$) thereby shifting a variable at time $t$ to its predecessor at time $t-1$. 

```{definition, label="backshiftop", name = "Backshift Operator"}
The backshift operator $B$ is defined as:

\[B{X_t} = {X_{t - 1}}\]

Given this definition, the backshift operator has the following properties:

  1. ${B^k}{X_t} = B^{k-1} B {X_t} = B^{k-1} {X_{t-1}} = ... = {X_{t - k}}$ for $k \in \mathbb{N}^{+}$
  2. $B^{u}{X_t} = {X_{t+u}}$ for $u \in \mathbb{Z}^{-}$ (i.e. negative integer powers)
  3. $Bc = c$ for $c \in \real$
  4. $B^0 = 1$

Often the backshift operator may be denoted as $L$ which can be confused with the notation of the linear operator provided earlier. Therefore, in the context of this chapter (and book), please notice that $L$ will denote a general linear operator and that $B$ will instead denote the backshift operator.

```{example, label="backshiftseasons", name = "Backshift operator for seasonality"} A certain power used for the backshift operator can be useful to deal with seasonal features in the observed time series. For example, in the case of monthly data, $B^{12} X_t = X_{t-12}$ would eventually allow to deal with cycles that occur every year while $B^{3} X_t = X_{t-3}$ would be more adapted to dealing with quarterly characteristics.

#### Difference Operator

A very useful operator that can be defined through the backshift operator $B$ is given by the difference operator $\nabla$. The latter is particularly helpful when there is a need to remove trends from the time series.

```{definition, label="diffop", name = "Difference Operator"}
The **Difference Operator** is defined as follows:

\[\nabla {X_t} = \left( {1 - B} \right){X_t} = {X_t} - {X_{t - 1}}.\]

The difference operator has the following properties:

1. ${\nabla ^k}{X_t} = {\nabla ^{k - 1}} \left( {\nabla {X_t}}\right) = \left( {1 - B} \right)^k {X_t}$
2. $\nabla c = 0$ for $c \in \real$
3. ${\nabla ^0}{X_t} = {X_t}$

```{example, label="diffex", name = "Repetitive Differencing"} To understand the repetitive nature of the difference operator, consider the case when the number of differences is $k = 2$:

[\begin{aligned} {\nabla ^2}{X_t} &= \nabla \left( {\nabla {X_t}} \right) \ &= \nabla \left( {{X_t} - {X_{t - 1}}} \right) = \nabla {X_t} - \nabla {X_{t - 1}} \ &= \left( {{X_t} - {X_{t - 1}}} \right) - \left( {{X_{t - 1}} - {X_{t - 2}}} \right) \ &= {X_t} - 2{X_{t - 1}} + {X_{t - 2}} \ \end{aligned} ]

Notice that at the end of the process, we are able to re-express the differencing operation in terms of the backshift operator:

[\begin{aligned} {\nabla ^2}{X_t} &= {X_t} - 2{X_{t - 1}} + {X_{t - 2}} = {X_t} - 2B{X_t} + B^2{X_t} \ &= \left( {1 - B} \right)\left( {1 - B} \right){X_t} = {\left( {1 - B} \right)^2}{X_t} \ \end{aligned} ]

#### Seasonal Difference Operator

A specific case of the backshift operator, as highlighted earlier, is the capacity of adapting to seasonal features in the time series. Indeed, there may be trends in the time series that occur over lags of months or even years. In order to remove these trends in the process, a seasonal version of the difference operator can therefore be used.

```{definition, label="seasonaldiff", name = "Seasonal Difference Operator"}
The seasonal difference operator can applied by using a seasonal backshift:

\[{\nabla _s}{X_t} = \left( {1 - B_s} \right){X_t} = {X_t} - {X_{t - s}} \]

where $s$ denotes the lag at which the seasonal trend occurs. Being a specific case of the difference operator, the previously discussed
properties associated with the difference operator generally hold. However, the seasonal difference operator is more specifically represented as follows:

\[{\nabla ^k_s}{X_t} = {\left( {1 - B_s} \right)^k}{X_t},\]

and therefore can be more complex to apply than a simple difference operator.

Linear Processes

The stationary models considered so far, including the ARMA models treated further on in this chapter, can all be represented as linear processes. Indeed, a linear process is a process that can be represented as an infinite linear combination of independent random variables $\left{W_t\right}$ with linear coefficients generally denoted as $\psi_j$, $j = -\infty,...,+\infty$. The following figure gives a general idea of how a linear process can be obtained.

knitr::include_graphics("images/filter/filtering.png")

To put this definition into context, recall Example \@ref(ex:exar1) where backsubstitution was used to represent an AR(1) model as a linear combination of a white noise process $\left{W_t\right}$. Using the backshift operator $B$ defined earlier, we can revisit this example in the following manner:

[\begin{aligned} {X_t} &= \phi {X_{t - 1}} + {W_t} \ {X_t} - \phi {X_{t - 1}} &= {W_t} \ \left( {1 - \phi B} \right){X_t} &= {W_t} \ {X_t} &= \frac{1}{{1 - \phi B}}{W_t} \ &= \sum\limits_{j = 0}^\infty {{\phi ^j}{B^j}{W_t}} = \sum\limits_{j = 0}^\infty {{\phi ^j}{W_{t - j}}} . \ \end{aligned} ]

As can be seen, under the condition that $\left|\phi\right| < 1$, an AR(1) can therefore be represented as a linear process. Given the above discussion and examples, let us now formalize the definition of this class of processes.

```{definition, label="lp", name="Linear Process"} A time series $\left{X_t\right}$ is defined to be a linear process if it can be expressed as a linear combination of white noise by:

[{X_t} = \mu + \sum\limits_{j = - \infty }^\infty {{\psi j}{W{t - j}}} ]

where $W_t \sim WN(0, \sigma^2)$ and $\sum\limits_{j = - \infty }^\infty {\left| {{\psi _j}} \right|} < \infty$.

Note, the latter assumption (i.e. $\sum\limits_{j =  - \infty }^\infty  {\left| {{\psi _j}} \right|}  < \infty$) is required to ensure that the series has a limit. Hence, the set of coefficients \[{\{ {\psi _j}\} _{j =  - \infty , \cdots ,\infty }}\]
can be viewed as linear filter that do not have to be all equal
nor symmetric as later examples will show. 

```{theorem, label="lpstat", name="Stationarity of Linear Process"}
A linear process $\left\{X_t\right\}$ in the form of Definition \@ref(def:lp) 
is stationary with:

\[\begin{aligned}
\mathbb{E}[X_t] &= \mu \\
\gamma(h) &= \sigma^2\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{\psi _{h + j}}}.
\end{aligned}\]
Given that the
series converges under $\sum\limits_{j =  - \infty }^\infty  {\left| {{\psi _j}} \right|}  < \infty$, the expectation and autocovariance can be computed in a straightforward manner. Indeed, for the expectation we have:

\[E\left[ {{X_t}} \right] = E\left[ {\mu  + \sum\limits_{j =  - \infty }^\infty  {{\psi _j}{W_{t - j}}} } \right] = \mu  + \sum\limits_{j =  - \infty }^\infty  {{\psi _j}E\left[ {{W_{t - j}}} \right]}  = \mu  + 0 = \mu \]

As for the autocovariance, noting that $\cov\left( {{W_{t + h - j}},{W_{t - i}}} \right) = 0$ for $i \ne j - h$, we have:

\[\begin{aligned}
  \cov \left( {{X_{t + h}},{X_t}} \right) &= \cov \left( {\mu  + \sum\limits_{j =  - \infty }^\infty  {{\psi _j}{W_{t + h - j}}} ,\mu  + \sum\limits_{i =  - \infty }^\infty  {{\psi _i}{W_{t - i}}} } \right) \\
   &= \sum\limits_{j =  - \infty }^\infty  {\sum\limits_{i =  - \infty }^\infty  {{\psi _j}{\psi _i}\cov \left( {{W_{t + h - j}},{W_{t - i}}} \right)} }  = \sum\limits_{j =  - \infty }^\infty  {{\psi _{j + h}}{\psi _j}\cov \left( {{W_t},{W_t}} \right)}  + \sum\limits_{j =  - \infty }^\infty  {\sum\limits_{i =  - \infty }^\infty  {{\psi _j}{\psi _i}\underbrace {\cov \left( {{W_{t + h - j}},{W_{t - i}}} \right)}_{ = 0}} }  \\
   &= \sigma _W^2\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{\psi _{j-h}}}.  \\ 
\end{aligned} \]

```{example, label="backshiftlp", name = "Backshift operators for Linear Processes"} A special notation for linear processes based on the backshift operator is as follows:

[\begin{aligned} {X_t} &= \mu + \psi \left( B \right){W_t} \ \psi \left( B \right) &= \sum\limits_{j = 0}^\infty {{\psi _j}{B^j}} \ \end{aligned} ]

This notation allows to define the processes in a more compact form.

### Examples of Linear Processes

Let us now consider how to represent the time series models discussed in
Section \@ref(basicmodels) as a linear process. Generally speaking, the work required to transform the traditional representation of a time series model into the form of a linear process consists mainly in finding the set of corresponding linear coefficients $\psi_j$ highlighted earlier.

```{example, label="lpwn", name="White Noise"}

The representation of white noise process $\left\{X_t\right\}$, defined in \@ref(wn), as a linear process is quite straighforward since we have that:

\[\psi _j = \begin{cases}
      1 , &\mbox{ if } j = 0\\
      0 , &\mbox{ if } |j| \ge 1,
\end{cases}\]

and $\mu = 0$. Indeed, $X_t = W_t$ with $W_t \sim WN(0, \sigma^2_W)$, and therefore $\psi_0 = 1$ while the other coefficients are zero.

```{example, label="lpma1", name="First Order Moving Average"}

Consider the MA(1) process $\left{X_t\right}$ defined in \@ref(ma1). The representation of this process as a linear process can be done when defining the coefficients $\psi_j$ as follows:

[\psi _j = \begin{cases} 1, &\mbox{ if } j = 0\ \theta , &\mbox{ if } j = 1 \ 0, &\mbox{ if } j \ge 2, \end{cases}]

and $\mu = 0$. Also in this case, as for the white noise process, the linear representation is directly observable from the traditional representation of the MA(1) process given by $X_t = W_t + \theta W_{t-1}$. It is in fact easy to see that $\psi_0 = 1$, $\psi_1 = \theta$ and the others are zero.

```{example, label="lpsma", name="Symmetric Moving Average"}

Similarly to the MA(1) process, let us define a symmetric moving average which is given by:

\[{X_t} = \frac{1}{{2q + 1}}\sum\limits_{j =  - q}^q {{W_{t + j}}} \]

The process $\left\{X_t\right\}$ is therefore defined for $q + 1 \le t \le n-q$ and can also be represented as a linear process since:

\[\psi _j = \begin{cases}
      \frac{1}{{2q + 1}} , &\mbox{ if } -q \le j \le q\\
      0 , &\mbox{ if } |j| > q.
\end{cases}\]

and $\mu = 0$.

```{example, label="lpar1", name="First Order Autoregressive Process"} A slightly less intuitive approach would be to represent an AR(1) process as a linear process. To do so, let $\left{X_t\right}$ in this case represent an AR(1) process, defined in \@ref(ar1) as being $X_t = \phi X_{t-1} + W_t$, with $\left| \phi \right| < 1$. If we use a first-step backsubsitution, from this we obtain that $X_t = \phi^2 X_{t-2} + \phi W_{t-1} + W_t$ and, by doing so recursively and using convergence results based on the value of $\phi$, this process can indeed be represented as a linear process by defining the coefficients $\psi_j$ as:

[\psi _j = \begin{cases} \phi^j , &\mbox{ if } j \ge 0\ 0 , &\mbox{ if } j < 0. \end{cases}]

## Autoregressive Models

The class of autoregressive models is based on the idea that previous values in the time series are needed to explain current values in the series. For this class of models, we assume that the $p$ previous observations are needed for this purpose and we therefore denote this class as AR($p$). In the previous  chapter, the model we introduced was an AR(1) in which only the immediately previous observation is needed to explain the following one and therefore represents a particular model which is part of the more general class of AR(p) models.

```{definition, label="arp", name = "Autoregressive Models of Order $p$"}
The AR(p) models can be formally represented as follows
$${X_t} = {\phi_1}{X_{t - 1}} + ... + {\phi_p}{X_{t - p}} + {W_t},$$
where $\phi_j \neq 0$, for $j=1,...,p$, and $W_t$ is a (Gaussian) white noise process with
variance $\sigma^2$. 

In general, we will assume that the expectation of the process $({X_t})$, as well as that of the following ones in this chapter, is zero. The reason for this simplification is that if $\mathbb{E} [ X_t ] = \mu$, we can define an AR(p) process around $\mu$ as follows:

$$X_t - \mu = \sum_{i = 1}^p \phi_i \left(X_{t-i} - \mu \right) + W_t,$$

which is equivalent to

$$X_t = \mu^{\star} + \sum_{i = 1}^p \phi_i X_{t-i} + W_t,$$

where $\mu^{\star} = \mu (1 - \sum_{i = 1}^p \phi_i)$. Therefore, to simplify the notation we will generally consider only zero mean processes, since adding means (as well as other deterministic trends) is easy.

A useful way of representing AR(p) processes is through the backshift operator introduced in the previous section and is as follows

[\begin{aligned} {X_t} &= {\phi_1}{X_{t - 1}} + ... + {\phi_p}{y_{t - p}} + {w_t} \ &= {\phi_1}B{X_t} + ... + {\phi_p}B^p{X_t} + {W_t} \ &= ({\phi_1}B + ... + {\phi_p}B^p){X_t} + {W_t} \ \end{aligned},]

which finally yields

$$(1 - {\phi _1}B - ... - {\phi_p}B^p){X_t} = {W_t},$$

or, in abbreviated form, can be expressed as

$$\phi(B){X_t} = W_t.$$

We will see that $\phi(B)$ is important to establish the stationarity of these processes and is called the autoregressive operator. Moreover, this quantity is closely related to another important property of AR(p) processes called causality. Before formally defining this new property, we consider the following example, which provides an intuitive illustration of its importance.

```{example, label="ncAR1", name="Non-causal AR(1)"} Consider a classical AR(1) model with $|\phi| > 1$. Such a model could be expressed as

$$X_t = \phi^{-1} X_{t+1} - \phi^{-1} W_{t+1} = \phi^{-k} X_{t+k} - \sum_{i = 1}^{k-1} \phi^{-i} W_{t+i}.$$

Since $|\phi| > 1$, we obtain

$$X_t = - \sum_{i = 1}^{\infty} \phi^{-j} W_{t-j},$$

which is a linear process and therefore is stationary. Unfortunately, such a model is useless because we need the future to predict the future itself and these processes are therefore called non-causal.

### Properties of AR(p) models

In this section we discuss the main properties of AR(p) processes. We first consider the causality of these models which has already been introduced in the previous paragraphs and then discuss the autocorrelation (and partial autocorrelation) of AR(p) models. 

#### Causality

In the previous section we already introduced the idea of a causal time series model and therefore let us now introduce this concept in a more formal manner.

```{definition, label="causalar", name = "Causality of AR(p) models"}
 An AR(p) model is said to be *causal* if the time series $(X_t)$ can be
 written as a one-sided linear process:

$$X_t = \sum_{j = 0}^{\infty} \psi_j W_{t-j} = \psi(B) W_t,$$

where $\psi(B) = \sum_{j = 0}^{\infty} \phi_j B^j$, $\sum_{j=0}^{\infty}|\phi_j| < \infty$ and $\phi_0 = 1$.

As discussed earlier this condition implies that only the past values of the time series can explain the future values of it, and not vice versa. Moreover, in the previous section we saw that linear processes are (weakly) stationary processes and therefore causality directly implies (weak) stationarity. However, it might be difficult and not obvious to show the causality of AR(p) processes by using the above definitions directly, thus the following property is particularly useful in practice.

```{theorem, label="theocausalar", name = "Causality of AR models"} An AR(p) model is causal if and only if $\phi(z) \neq 0$ for $|z| \leq 1$. Then, the coefficients of the one-sided linear process given in \@ref(def:causalar) can be obtained by solving \begin{equation} \psi(z) = \frac{1}{\sum_{j=0}^{\infty} \phi_j z^j} = \frac{1}{\phi(z)}, \mbox{ } |z| \leq 1. \end{equation}

The proof of this result is omitted but can (for example) be found in Appendix B of @shumway2010time. Moreover, it can be seen how there is no solution to the above equation if $\phi(z) = 0$ for $|z| \leq 1$ and therefore, as highlighted earlier, an AR(p) is causal if and only if $\phi(z) \neq 0$ for $|z| \leq 1$. A condition for this to be respected is for the roots of $\phi(z) = 0$ to lie *outside the unit circle*.

Before further discussing the properties of AR(p) models we consider the following two examples which discuss the causality (and stationarity) of AR(2) models.

```{example, label="ex1causality", name = "Transforming an AR(2) into a linear process"}

Given the following AR(2) $X_t = 1.3X_{t-1} - 0.4X_{t-2} + W_t$ one could
wonder if this process can be written as a one-sided linear process as in
Definition \@ref(def:causalar). This can be done using the following approach:

*Step 1:* The autoregressive operator of this model can be expressed as:

\[\phi(z) = 1 - 1.3z + 0.4z^2 = (1 - 0.5z)(1 - 0.8z) ,\]

and has roots $2 > 1$ and $1.25 > 1$. Thus, we should be able to covert it
into linear process.


*Step 2:* From Theorem \@ref(thm:theocausalar) we know that if an AR(p) process has all its roots outside the unit circle, we can write $X_t = \phi^{-1}(B)W_t$ and "inverse" the autoregressive operator $\phi(B)$ as follows:

\[\phi^{-1}(z) = \frac{1}{(1 - 0.5z)(1 - 0.8z)} = \frac{c_1}{(1 - 0.5z)} + \frac{c_2}{(1 - 0.8z)} = \frac{c_2(1 - 0.5z) + c_1(1 - 0.8z)}{(1 - 0.5z)(1 - 0.8z)} .\]

Since we can think of the above equation as valid for any $z$, we will solve  the following system of equations to get $c_1$ and $c_2$,

\[\begin{cases}
        c_1 + c_2 &= 1\\
        -0.5 c_2 - 0.8 c_1 &= 0
\end{cases}
\implies 
\begin{cases}
        c_1 &= -\frac{5}{3}\\
        c_2 &= \frac{8}{3},
\end{cases}
\]

and, we obtain

\[\phi^{-1}(z) = \frac{-5}{3(1 - 0.5z)} + \frac{8}{3(1 - 0.8z)} .\]

*Step 3:* Using the Geometric series, i.e. 
$a \sum_{k = 0}^{\infty}r^k = \frac{a}{1-r}, \mbox{ if } |r| < 1$, we have

\[\begin{cases}
      \frac{-5}{3(1 - 0.5z)} &= \frac{-5}{3} \sum_{j = 0}^{\infty}0.5^jz^j , \mbox{ if } |z| < 2\\
      \frac{8}{3(1 - 0.8z)} &= \frac{8}{3} \sum_{j = 0}^{\infty}0.8^jz^j , \mbox{ if } |z| < 1.25.
\end{cases}\]

This allows to express $\phi^{-1}(z)$ as

\[\phi^{-1}(z) = \sum_{j = 0}^{\infty} \left[ \frac{-5}{3} (0.5)^j + \frac{8}{3} (0.8)^j \right] z^j, \mbox{ if } |z| < 1.25 .\]

*Step 4:* Finally, we obtain
\[X_t = \phi^{-1}(B)W_t = \sum_{j = 0}^{\infty} \left[ \frac{-5}{3} (0.5)^j + \frac{8}{3} (0.8)^j \right] B^j W_t = \sum_{j = 0}^{\infty} \left[ \frac{-5}{3} (0.5)^j + \frac{8}{3} (0.8)^j \right] W_{t-j},
\]
which verifies that the AR(2) is causal (and stationary).

```{example, label="ex2causality", name = "Causal conditions for an AR(2) processes"} We already know that an AR(1) is causal with the simple condition $|\phi_1| < 1$. It could seem natural to believe that an AR(2) should be causal (and therefore stationary) with the conditon: $|\phi_i| < 1, \, i = 1,2$. However, this is not the case since an AR(2) can be expressed as

[X_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + W_t = \phi_1 B X_t + \phi_2 B^2 X_t + W_t,]

thereby delivering the following autoregressive operator:

[\phi(z) = 1 - \phi_1 z - \phi_2 z^2.]

Therefore, the process is causal when the roots of $\phi(z)$ lie outside the unit circle. Letting $z_1$ and $z_2$ denote those roots, we impose the following constraints to ensure the causality of the model:

[\begin{aligned} |z_1| &> 1, \;\;\;\; \text{where} \;\; &z_1 = \frac{\phi_1 + \sqrt{\phi_1^2 + 4\phi_2}}{-2 \phi_2},\ |z_2| &> 1, \;\;\;\; \text{where} \;\; &z_2 = \frac{\phi_1 - \sqrt{\phi_1^2 + 4\phi_2}}{-2 \phi_2}, \end{aligned}] noting that $z_1$ and $z_2$ can be complex values.

Thus we can represent $\phi_1$ and $\phi_2$ by $z_1$ and $z_2$, [\begin{aligned} \phi_1 = (z_1^{-1} + z_2^{-1}),\ \phi_2 = -(z_1 z_2)^{-1}. \end{aligned}]

Moreover we have the following equivalent condition for causality:

$$\begin{cases} |z_1| &> 1\ |z_2| &> 1, \end{cases}$$ if and only if $$\begin{cases} \phi_1 + \phi_2 &< 1\ \phi_2 - \phi_1 &< 1\ |\phi_2| &< 1. \end{cases}$$

```{proof, name="Proof of Causal Region for AR(2) Process"}

We have known that $\phi_1=\frac{1}{z_1}+\frac{1}{z_2}$ and $\phi_2=-\frac{1}{z_1z_2}$, where $z_1$ and $z_2$ are the roots of $1-\phi_1z-\phi_2z^2=0$.

We want to show that $\left|{z_1}\right|>1$ and $\left|{z_2}\right|>1$ if and only if the following three condition holds. $$\phi_1+\phi_2<1,\quad \phi_2-\phi_1<1,\text{ and }\left|{\phi_2}\right|<1$$

First, we try to show the "if" direction:

When the above condition is true, if $z_1$ and $z_2$ are both complex, because they are conjugate, we can have $\left|{z_1}\right|=\left|{z_2}\right|$. Based on the formula that $z_1=\frac{\phi_1+\sqrt{\phi_1^2+4\phi_2}}{2\phi_2}$ with $\phi_1^2+4\phi_2<0$ and thus $\phi_2<0$, we can have

$$\left|{z_1}\right|^2=\frac{\phi_1^2-\phi_1^2-4\phi_2}{4\phi_2^2}=\frac{-4\phi_2}{4\phi_2^2}=\frac{1}{\left|{\phi_2}\right|}>1$$ The last inequality comes from the fact that $\left|\phi_2\right|<1$. Because $\left|{z_1}\right|=\left|{z_2}\right|$, we can then have $\left|{z_1}>1\right|$ and $\left|{z_2}\right|>1$.

If $z_1$ and $z_2$ are both real, because $\left|{\phi_2}\right|=\frac{1}{\left|{z_2}\right|\left|{z_2}\right|}<1$, at least one of $\left|{z_1}\right|$ and $\left|{z_2}\right|$ is greater than $1$, say it is $\left|{z_1}\right|>1$. Also, we can have $$\phi_1+\phi_2-1=\frac{1}{z_1}+\frac{1}{z_2}-\frac{1}{z_1z_2}-1=\left({\frac{1}{z_1}-1}\right)\left({1-\frac{1}{z_2}}\right)<0$$

$$1-\left({\phi_2-\phi_1}\right)=\frac{1}{z_1}+\frac{1}{z_2}+\frac{1}{z_1z_2}+1=\left({\frac{1}{z_1}+1}\right)\left({\frac{1}{z_2}+1}\right)>0$$

Because $\left|{z_1}\right|>1$, we can have $\frac{1}{z_1}+1>0$ and $\frac{1}{z_1}-1<0$. Thus, we must have $1+\frac{1}{z_2}>0$ and $1-\frac{1}{z_2}>0$. If $01$. Thus, the "if" direction is proved.

Then, we try to show the only if" direction:

When we already have $\left|{z_1}\right|>1$ and $\left|{z_2}\right|>1$, then $\left|{\phi_2}\right|=\frac{1}{\left|{z_1}\right|\left|{z_2}\right|}<1$ is obviously true.

To show the other two, if $z_1$ and $z_2$ are both complex, we can let $z_1=a+ib$ and $z_2=a-ib$ with $a^2+b^2>1$, $a\neq 0$ and $b\neq 0$. Then $$\phi_1+\phi_2=\frac{1}{z_1}+\frac{1}{z_2}-\frac{1}{z_1z_2}=\frac{2a-1}{a^2+b^2}$$

Because $a^2+b^2-\left({2a-1}\right)=\left({a-1}\right)^2+b^2>0$ and $a^2+b^2>1>0$, we can have $\frac{2a-1}{a^2+b^2}<1$ and thus $\phi_1+\phi_2<1$. Similarly, we have $$\phi_2-\phi_1=-\left({\frac{1}{z_1}+\frac{1}{z_2}+\frac{1}{z_1z_2}}\right)=\frac{-2a-1}{a^2+b^2}$$

Because $a^2+b^2-\left({-2a-1}\right)=\left({a+1}\right)^2+b^2>0$ and $a^2+b^2>$, we can have $\frac{-2a-1}{a^2+b^2}<1$ and thus $\phi_2-\phi_1<1$.

If $z_1$ and $z_2$ are both real, we can have $$\phi_1+\phi_2=\frac{1}{z_1}+\frac{1}{z_2}-\frac{1}{z_1z_2}=\frac{1}{z_1}\left({1-\frac{1}{z_2}}\right)+\frac{1}{z_2}<1-\frac{1}{z_2}+\frac{1}{z_2}=1$$

The above inequality comes from the fact that $1-\frac{1}{z_2}>0$ and $\frac{1}{z_1}\in\left({-1, 0}\right)\cup\left({0, 1}\right)$.

Similarly, we have $$\phi_2-\phi_1=-\left({\frac{1}{z_1}+\frac{1}{z_2}+\frac{1}{z_1z_2}}\right)=-\frac{1}{z_1}\left({\frac{1}{z_2}+1}\right)-\frac{1}{z_2}<\frac{1}{z_2}1-\frac{1}{z_2}=1$$

The above inequality holds for similar reason. Thus, we can have $\phi_1+\phi_2<1$ and $\phi_2-\phi_1 <1$.

Thus, we proved that $\left|{z_1}\right|>1$ and $\left|{z_2}\right|>1$ if and only if the following three condition holds. $$\phi_1+\phi_2<1,\quad \phi_2-\phi_1<1,\text{ and }\left|{\phi_2}\right|<1$$

Finally, the causal region of an AR(2) is depicted in Figure \@ref(fig:causalAR2).

```r

Autocorrelation

In this section we discuss the autocorrelation of (causal) AR(p) processes. Before considering the general case of an AR(p) model, we revisit Example \@ref(ex:ex1causality) and derive the ACF of the AR(2) model presented in this example.

```{example, label="ar2acf", name = "Autocorrelation of an AR(2)"} Considering the same model as in Example \@ref(ex:ex1causality), i.e. $X_t = 1.3X_{t-1} - 0.4X_{t-2} + W_t$, we can derive the ACF using the following steps:

Step 1: Find the homogeneous difference equation with respect to the ACF $\rho(h)$, which in this case is given by: [\rho(h) - 1.3\rho(h-1) + 0.4\rho(h-2) = 0, \mbox{ } h = 1,2,...] and the initial conditions are $\rho(0) = 1$ and $\rho(-1) = \frac{13}{14}$. Note that the above equation is a homogenous difference equation of order 2.

Step 2: Using the results of Example \@ref(ex:ex1causality), we have: [\phi(z) = 1 - 1.3z + 0.4z^2 = (1 - 0.5z)(1 - 0.8z) ,] and the roots of this equation are given by $z_1 = 2 > 1$ and $z_2 = 1.25 > 1$. Moreover, since $z_1$ and $z_2$ are real and distinct, and since they correspond to the solution of a homogenous difference equation of order 2, we obtain:

[\rho(h) = c_1 z_1^{-h} + c_2 z_2^{-h}.]

Step 3: Solve $c_1$ and $c_2$ based on the two initial conditions found in Step 1, i.e. [\begin{cases} \rho(0) = c_1 + c_2 = 1\ \rho(-1) = 2c_1 + 1.25c_2 = \frac{13}{14}, \end{cases},] implying that $c_1 = -\frac{3}{7}$ and $c_2 = \frac{10}{7}$. The ACF for this model is therefore given by [\rho(h) = -\frac{3}{7}2^{-h} + \frac{10}{7}\left(\frac{5}{4}\right)^{-h} ,] which is represented in the plot below.

```r

The method used in the previous paragraphs is only applicable for AR(2) models with roots that are distinct and real, which is true when $\phi_2 > - \phi^1/4$. In the case where $\phi_2 = - \phi^1/4$, the autoregressive operator has a single root and $\rho(h)$ can be obtained by determining (using initial conditions) the constants $c_1$ and $c_2$ of the following expression:

$$\rho(h) = z_1^{-h} \left(c_1 + c_2 h\right).$$ When $\phi_2 > - \phi^1/4$ the roots are complex conjugate pairs and the solution is given by:

$$\rho(h) = c_1 |z_1|^{-h} \cos \left(h \theta + c_2\right), $$ where the constants depend on initial conditions, while $\theta = \text{arg}(z_1)$.

Therefore, we have that $\rho(h) \to 0$ exponentially fast as $h \to \infty$ and when $\phi_2 > - \phi^1/4$ $\rho(h)$ it goes to zero in a sinusoidal fashion. This behavior is illustrated in the next example.

```{example, label="ar2acfExample", name = "Autocorrelation of AR(2) Processes"} Consider the following models:

[\begin{aligned} \text{Model 1}: \;\;\; X_t &= X_{t-1} - 0.25 X_{t-2} + W_t \ \text{Model 2}: \;\;\; X_t &= 0.5 X_{t-1} + 0.25 X_{t-2} + W_t \ \text{Model 3}: \;\;\; X_t &= -1.5 X_{t-1} - 0.75 X_{t-2} + W_t. \end{aligned} ]

It is easy to verify that the first one has real distrinct roots, the second has a unique real root while the last has complex roots. This is illustrated in Figure \@ref(fig:causalAR2part2) which depicts the causal region of an AR(2) that has been separated between models with real and complex roots.

The ACF of these models are represented in Figure \@ref(fig:ACFAR2eg). As expected the 
three models correspond to an ACF that dampens exponentially fast but only 
Model 3 exhibits a sinusoidal features.

```r

Next, we consider the ACF for a general causal AR($p$) model. Unfortunately, this is far more complicated than our previous example.


```{example, label="arpacf", name = "Autocorrelation of an AR($p$)"}

Recall that the AR($p$) models can formally be represented as follows $${X_t} = {\phi_1}{X_{t - 1}} + ... + {\phi_p}{X_{t - p}} + {W_t},$$ where $W_t$ is a (Gaussian) white noise process with variance $\sigma^2$.

There are two ways to derive the ACF for general AR($p$) models. Considering the first approach, since we assume our AR($p$) model is causal, we can write it as a one-sided linear process: $X_t = \sum_{j = 0}^{\infty} \psi_j W_{t-j}$ which defines the autocovariance function as $\gamma(h) = cov(X_{t+h},X_t) = \sigma^2 \sum_{j=0}^{\infty} \psi_j \psi_{j+h}$, for $h \geq 0$. The difficulty of this method is to solve for the $\psi$-weights. Treating the AR(p) as a special case of ARMA(p,q) models, we can let the MA polynomials $\theta(z) = 1$ and solve the $\psi$-weights by matching the coefficients in $\phi(z)\psi(z) = \theta(z)$.

Considering now the second method, we can also use the procedure we used for the AR(2) example. This consists in finding a homogeneous difference equation with respect to the ACF $\rho(h)$ thereby solving it directly. To do so we need following steps.

Step 1: Find the homogeneous difference equation with repsect to the ACF $\rho(h)$.

Firstly verify whether our model is causal. If it is then we have [\rho(h) - \phi_1\rho(h-1) - \cdots - \phi_p\rho(h-p) = 0, \mbox{ } h \geq p. ]

Step 2: Solve the roots of the associated AR polynomials.

The polynomials can be written as: [\phi(z) = 1 - \phi_1z - \cdots - \phi_pz^p .]

Let us suppose now that the polynomials have $r$ distinct roots and let $m_j$ denote the number of replicates of $z_j$, such that $m_1 +m_2 + \cdots +m_r = p$. Then the general solution of the homogeneous difference equation is [\rho(h) = z_1^{-h}P_1(h) + z_2^{-h}P_2(h) + \cdots + z_r^{-h}P_r(h), \mbox{ } h \geq p,] where $P_j(h)$ is the polynomial in $h$ of degree $m_j - 1$.

Step 3: Solve every $P_j(h)$ based on $p$ given initial conditions on $\rho(h)$.

Remark: Since the AR($p$) model is causal, the roots we obtained in step 2 should be outside the unit circle (i.e. $|z_i| > 1$, for $i = 1, \cdots, r$). Then the absolute value of the general solution is given by [|\rho(h)| = \left|\frac{P_1(h)}{z_1^{h}} + \frac{P_2(h)}{z_2^{h}} + \cdots + \frac{P_r(h)}{z_r^{h}}\right| \leq \left|\frac{P_1(h)}{z_1^{h}}\right| + \left| \frac{P_2(h)}{z_2^{h}}\right| + \dots + \left|\frac{P_r(h)}{z_r^{h}}\right| \leq \frac{r \left|P_r(h)\right|}{\min_{j = 1, \dots, r} \left|z_j \right|^{h}},]

and, from the right hand side of the last inequality, we can find that the rate of convergence would be dominated by $\frac{1}{\min_{j = 1, \dots, r} \left|z_j \right|^{h}}$. Thus the ACF $\rho(h)$ will go to zero exponentially as $h \to \infty$.

EXAMPLE MISSING HERE 

The ACF can overall be extremely useful in detecting if there appears to be an autocorrelation pattern in the data where, as seen above, an exponentially or sinusoidally decaying ACF would suggest the presence of an AR(p) process generating the time series. However, these patterns can appear to be very similar even for AR(p) models with considerably different orders (i.e. an AR(1) or and AR(3) for example). In this sense, the ACF has limited utility if there is a need to detect the order of the underlying AR(p) process. For this reason, the next section will discuss another representation of a time series which can be of more help in detecting, for example, the order of an AR(p) model.

#### Partial autocorrelation of AR(p) models

```{definition, label="pacf", name = "Partial autocorrelation"}
For a stationary process, $X_t$, the Partial AutoCorrelation Function (PACF)
can be denoted as $\phi_{hh}$, for $h = 1, 2, \dots,$ where
$$\phi_{11} = \corr(X_{t+1}, X_t) = \rho(1),$$
and 
$$\phi_{hh} = \corr(X_{t+h} - \hat{X}_{t+h}, X_t - \hat{X}_t), \mbox{ } h \geq 2.$$

Remark From the above definition we can think of the partial autocorrelation $\phi_{hh}$ as the correlation between the residual of $X_{t+h}$ after removing its best linear predictor on the vector space spanned by ${ X_{t+1}, \dots, X_{t+h-1} }$ and the residual of $X_t$ after removing its best linear predictor on the same vector space. Similarly to linear regression, the residuals should be independent of the vector space spanned by ${ X_{t+1}, \dots, X_{t+h-1} }$ after projection.

We will discuss about the best linear predictor in the section dedicated to forecasting and therefore here we will just mention how to obtain the best linear predictor.

```{definition, label="blp", name = "Best Linear Predictor (BLP)"} Given a time series $X_1, X_2, \dots, X_t$ with zero mean, the best linear predictor of $X_{t+h}$ can be written as $\hat{X}{t+h} = \sum{j=1}^t \alpha_j X_j$ such that it satisfies the prediction equations: [ \mathbb{E}(X_{t+h} - \hat{X}{t+h}) = 0, ] and [ \mathbb{E} [(X{t+h} - \hat{X}_{t+h})X_j ] = 0, \mbox{ for } i = 1, \dots, t.]

According to the projection theorem, we can show that satisfying the prediction equations is equivalent to minimizing the mean squared error $\mathbb{E}((X_{t+h} - \hat{X}_{t+h})^2)$.
Thus we have two equivalent ways to obtain the best linear predictor.

```{example, label="pacfar1", name = "PACF of an AR(1)"}
Consider a causal AR(1) model ${X_t} = \phi {X_{t - 1}} + {W_t}$. We have, 
\[\phi_{11} = \corr(X_{t+1}, X_t) = \rho(1) = \phi,\]
and 
\[\phi_{22} = \corr(X_{t+2} - \hat{X}_{t+2}, X_t - \hat{X}_t).\]
Next, we find $\hat{X}_t$ and $\hat{X}_{t+2}$. Since $X_{t+1}$ is the only
random variavle between $X_t$ and $X_{t+2}$, $\hat{X}_t$ and $\hat{X}_{t+2}$ are the best linear predictors on the vector space spanned by $X_t$ and we can obtain them by minimizing the MSE: 
\[\mathbb{E}(X_{t+2} - \hat{X}_{t+2})^2 = \mathbb{E}(X_{t+2} - \beta_1 X_{t+1})^2 = \gamma(0) - 2\beta_1 \gamma(1) + \beta_1^2 \gamma(0),\]
By minimizing the MSE, we have $\beta_1 = \frac{\gamma(1)}{\gamma(0)} = \phi$.

Similarily, by minimizing
\[\mathbb{E}(X_{t} - \hat{X}_{t})^2 = \mathbb{E}(X_{t} - \beta_2 X_{t+1})^2 = \gamma(0) - 2\beta_2 \gamma(1) + \beta_2^2 \gamma(0),\]
we have $\beta_2 = \frac{\gamma(1)}{\gamma(0)} = \phi$.

Equivalently we can use the prediction equations and thus we have
\[\mathbb{E}[(X_{t+2} - \hat{X}_{t+2})X_{t+1}] = \mathbb{E}[(X_{t+2}X_{t+1} - \beta_1 X_{t+1}^2)] = \gamma(1) - \beta_1 \gamma(0) = 0,\]
and
\[\mathbb{E}[(X_{t} - \hat{X}_{t})X_{t+1}] = \mathbb{E}[(X_{t}X_{t+1} - \beta_2 X_{t+1}^2)] = \gamma(1) - \beta_2 \gamma(0) = 0,\]
Consequently we can get the same solutions.

Therefore, \[\phi_{22} = \corr(X_{t+2} - \phi X_{t+1}, X_t - \phi X_{t+1}) = \corr(W_{t+2}, X_t - \phi X_{t+1}) = 0,\]
noting that the last equation is based on causality.

```{example, label="pacfarp", name = "PACF of AR(p)"} In this example we would like to show that the PACF characterizes the order of AR(p) models. Indeed, when $h > p$, the PACF of an AR(p) $\phi_{hh} = 0$. Suppose we have a causal AR(p) model, $X_{t+h} = \sum_{j=1}^p \phi_j X_{t+h-j} + W_{t+h}$ and we want to calculate $\phi_{hh}$.

The best linear predictor of $X_{t+h}$ is [\hat{X}{t+h} = \mathbb{E}\left[X{t+h}|X_t, \dots, X_{t+h-1}\right] = \mathbb{E}\left[\sum_{j=1}^p \phi_j X_{t+h-j} + W_{t+h}|X_t, \dots, X_{t+h-1}\right] = \sum_{j=1}^p \mathbb{E}\left[\phi_j X_{t+h-j}|X_t, \dots, X_{t+h-1}\right] = \sum_{j=1}^p \phi_j X_{t+h-j}.] Thus when $h > p$, [ \phi_{hh} = corr(X_{t+h} - \hat{X}{t+h}, X_t - \hat{X}_t) = corr(W{t+h}, X_t - \hat{X}_t) = 0. ]

With these properties in mind, it would be useful to have an estimator for the PACF just as we defined an estimator for the ACF in the previous chapter. If we re-express the above definitions of the PACF $\phi_{hh}$ using the linear predictors and taking from the defintion given earlier followed by its subsequent remark, we can redefine the PACF $\phi_{hh}$ as being the coefficient $\beta_h$ of the linear predictor
\[\hat{X}_{t+h} = \sum_{j=1}^h \beta_j X_{t+h-j}\]
since $\beta_h$ represents the effect of $X_t$ on $X_{t+h}$ once the linear effect of the intermediate variables has been removed. Therefore, given this linear problem, we can obtain an estimate of the PACF ($\hat{\phi}_{hh}$) as the least squares solution for $\beta_h$ to the following regression problem
\[x_{t+h} = \sum_{j=1}^h \beta_j x_{t+h-j},\]
where $(x_t)_{t=h,...,T}$ is the observed time series. Adapting the asymptotics for the coefficients of a linear regression, we have that $\hat{\phi}_{hh}$ is asymptotically normally distributed and, if the process $(X_t)$ is a white noise, its variance is given by
\[\var(\hat{\phi}_{hh}) \approx \frac{1}{T}.\]
Hence, as for the ACF, the PACF can be tested for significance by building approximate confidence intervals using $\pm 2/\sqrt{T}$ as critical limits. Moreover, with such an estimator and confidence intervals, it would be possible to use the information from the estimated ACF and that of the PACF to determine the presence and eventual order of an AR(p) model underlying the observed time series.




### Estimation of AR(p) models

Given the above defined properties of the AR(p) models, we will now discuss
how these models can be estimated, more specifically how the p+1 parameters 
can be obtained from an observed time series. Indeed, a reliable estimation of 
these models is necessary in order to interpret and describe different natural 
phenomena and/or forecast possible future values of the time series.

A first approach builds upon the earlier definition of the AR(p) as being a
linear process. And let it be **causal**. Recall that
\begin{equation}
    X_t = \sum_{j = 1}^{p} \phi_j X_{t-j} + W_t
\end{equation}
which delivers the following autocovariance function
\begin{equation}
    \gamma(h) = cov(X_{t+h}, X_t) = cov(\sum_{j = 1}^{p} \phi_j X_{t+h-j} + W_{t+h}, X_t) = \sum_{j = 1}^{p} \phi_j \gamma(h-j), \mbox{ } h \geq 0.
\end{equation}
Rearranging the above expressions we obtain the following general equations
\begin{equation}
    \gamma(h) - \sum_{j = 1}^{p} \phi_j \gamma(h-j) = 0, \mbox{ } h \geq 1
\end{equation}
and, recalling that $\gamma(h) = \gamma(-h)$ and $\gamma(0) = cov(X_t, X_t)$,
\begin{equation}
    \gamma(0) - \sum_{j = 1}^{p} \phi_j \gamma(j) = \sigma_w^2.
\end{equation}
We can now define the Yule-Walker equations.

```{definition, label = "yweqs", name = "Yule-Walker Equations"}
The Yule-Walker equations are given by
\begin{equation}
    \gamma(h) = \phi_1 \gamma(h-1) + ... + \phi_p \gamma(h-p), \mbox{ } h = 1,...,p
\end{equation}
and
\begin{equation}
    \sigma_w^2 = \gamma(0) - \phi_1 \gamma(1) - ... - \phi_p \gamma(p).
\end{equation}
which in matrix notation can be defined as follows
\begin{equation}
    \Gamma_p \mathbf{\phi} = \mathbf{\gamma}_p \text{ and } \sigma_w^2 = \gamma(0) - \mathbf{\phi}'\mathbf{\gamma}_p
\end{equation}
where $\Gamma_p$ is the $p\times p$ matrix with its $\left(k,j\right)^{th}$ element be the autocovariances $\gamma(k-j), j,k = 1, ...,p$ while $\mathbf{\phi} = (\phi_1,...,\phi_p)'$ and $\mathbf{\gamma}_p = (\gamma(1),...,\gamma(p))'$ are $p\times 1$ vectors.

Considering the Yule-Walker equations, it is possible to use a method of moments approach and simply replace the theoretical quantities given in the previous definition with their empirical (estimated) counterparts that we saw in the previous chapter. This gives us the following Yule-Walker estimators \begin{equation} \hat{\mathbf{\phi}} = \hat{\Gamma}_p^{-1}\hat{\mathbf{\gamma}}_p \text{ and } \hat{\sigma}_w^2 = \hat{\gamma}(0) - \hat{\mathbf{\gamma}}_p'\hat{\Gamma}_p^{-1}\hat{\mathbf{\gamma}}_p \end{equation}

These estimators have the following asymptotic properties.

Property: Consistency and Asymptotic Normality of Yule-Walker estimators The Yule-Walker estimators for a causal AR(p) model have the following asymptotic properties:

\begin{equation} \sqrt{T}(\hat{\mathbf{\phi}}- \mathbf{\phi}) \xrightarrow{\mathcal{D}} \mathcal{N}(\mathbf{0},\sigma_w^2\Gamma_p^{-1}) \text{ and } \hat{\sigma}_w^2 \xrightarrow{\mathcal{P}} \sigma_w^2 . \end{equation}

Therefore the Yule-Walker estimators have an asymptotically normal distribution and the estimator of the innovation variance is consistent. Moreover, these estimators are also optimal for AR(p) models, meaning that they are also efficient. However, there exists another method which allows to achieve this efficiency also for general ARMA models and this is the maximum likelihood method. Considering an AR(1) model as an example, and assuming without loss of generality that the expectation is zero, we have the following representation of the AR(1) model \begin{equation} X_t = \phi X_{t-1} + W_t \end{equation} where $|\phi|<1$ and $W_t \overset{iid}{\sim} \mathcal{N}(0,\sigma_w^2)$. Supposing we have observations issued from this model $(x_t){t=1,...,T}$, then the likelihood function for this setting is given by \begin{equation} L(\phi,\sigma_w^2) = f(\phi,\sigma_w^2|x_1,...,x_T) \end{equation} which, for an AR(1) model, can be rewritten as follows \begin{equation} L(\phi,\sigma_w^2) = f(x_1)f(x_2|x_1)\cdot \cdot \cdot f(x_T|x_{T-1}). \end{equation} If we define $\Omega_t^p$ as the information contained in the previous $p$ observations to time $t$, the above can be generalized for an AR(p) model as follows \begin{equation} L(\phi,\sigma_w^2) = f(x_1,...,x_p)f(x_{p+1}|\Omega_{p+1}^p)\cdot \cdot \cdot f(x_T|\Omega_{T-1}^p) \end{equation} where $f(x_1,...,x_p)$ is the joint probability distribution of the first $p$ observations. A discussion on how to find $f(x_1,...,x_p)$ will be presented in the following paragraphs based on the approach to find $f(x_1)$ in the AR(1) likelihood. Going back to the latter, we know that $x_t|x{t-1} \sim \mathcal{N}(\phi x_{t-1},\sigma_w^2)$ and therefore we have that \begin{equation} f(x_t|x_{t-1}) = f_w(x_t - \phi x_{t-1}) \end{equation} where $f_w(\cdot)$ is the distribution of $w_t$. This rearranges the likelihood function as follows \begin{equation} L(\phi,\sigma_w^2) = f(x_1)\prod_{t=2}^T f_w(x_t - \phi x_{t-1}) \end{equation} where $f(x_1)$ can be found through the causal representation \begin{equation} x_1 = \sum_{j=0}^{\infty} \phi^j w_{1-j} \end{equation} which implies that $x_1$ follows a normal distribution with zero expectation and a variance given by $\frac{\sigma_w^2}{(1-\phi^2)}$. Based on this, the likelihood function of an AR(1) finally becomes \begin{equation} L(\phi,\sigma_w^2) = (2\pi \sigma_w^2)^{-\frac{n}{2}} (1 - \phi^2)^{\frac{1}{2}} \exp \left(-\frac{S(\phi)}{2 \sigma_w^2}\right) \end{equation} with $S(\phi) = (1-\phi)^2 x_1^2 + \sum_{t=2}^T (x_t -\phi x_{t-1})^2$. Once the derivative of the logarithm of the likelihood is taken, the minimization of the negative of this function is usually done numerically. However, if we condition on the initial values, the AR(p) models are linear and, for example, we can then define the conditional likelihood of an AR(1) as \begin{equation} L(\phi,\sigma_w^2|x_1) = (2\pi \sigma_w^2)^{-\frac{n-1}{2}} \exp \left(-\frac{S_c(\phi)}{2 \sigma_w^2}\right) \end{equation} where \begin{equation} S_c(\phi) = \sum_{t=2}^T (x_t -\phi x_{t-1})^2 . \end{equation} The latter is called the conditional sum of squares and $\phi$ can be estimated as a straightforward linear regression problem. Once an estimate $\hat{\phi}$ is obtained, this can be used to obtain the conditional maximum likelihood estimate of $\sigma_w^2$ \begin{equation} \hat{\sigma}_w^2 = \frac{S_c(\hat{\phi})}{(n-1)} . \end{equation}

The estimation methods presented so far are standard ones for these kind of models. Nevertheless, if the data suffers from some form of contamination, these methods can become highly biased. For this reason, some robust estimators are available to limit this problematic if there are indeed outliers in the observed time series. A first solution is given by the estimator proposed in Kunsch (1984) who underlines that the MLE score function of an AR(p) is given by \begin{equation} \kappa(\mathbf{\theta}|x_j,...x_{j+p}) = \frac{\partial}{\partial \mathbf{\theta}} (x_{j+p} - \sum_{k=1}^p \phi_k x_{j+p-k})^2 \end{equation} where $\theta$ is the parameter vector containing, in the case of an AR(1) model, the two parameters $\phi$ and $\sigma_w^2$ (i.e.$\theta = [\phi \,\, \sigma_w^2]$). This delivers the estimating equation \begin{equation} \sum_{j=1}^{n-p} \kappa (\hat{\mathbf{\theta}}|x_j,...x_{j+p}) = 0 . \end{equation} The score function $\kappa(\cdot)$ is clearly not bounded, in the sense that if we arbitrarily move a value of $(x_t)$ to infinity then the score function also goes to infinity thereby delivering a biased estimation procedure. To avoid that outlying observations bias the estimation excessively, a bounded score function can be used to deliver an M-estimator given by \begin{equation} \sum_{j=1}^{n-p} \psi (\hat{\mathbf{\theta}}|x_j,...x_{j+p}) = 0, \end{equation} where $\psi(\cdot)$ is a function of bounded variation. When conditioning on the first $p$ observations, this problem can be brought back to a linear regression problem which can be applied in a robust manner using the robust regression tools available in R such as rlm(...) or lmrob(...). However, another available tool in R which can applied directly without conditioning also for general ARMA models is the gmwm(...) function which, by specifying the option robust = TRUE. This function makes use of a quantity called the wavelet variance (denoted as $\nu$) which is estimated robustly and then used to retrieve the parameters $\theta$ of the time series model. The robust estimate is obtained by solving the following minimization problem \begin{equation} \hat{\theta} = \underset{\theta \in \Theta}{\argmin} (\hat{\nu} - \nu{\theta})^T\Omega(\hat{\nu} - \nu{\theta}), \end{equation} where $\hat{\nu}$ is the robustly estimated wavelet variance, $\nu{\theta}$ is the theoretical wavelet variance and $\Omega$ is positive definite weighting matrix. Below we show some simulation studies where we present the results of the above estimation procedures in absence and in presence of contamination in the data.

In particular, we simulate three different processes $X_t, Y_t, Z_t$ from [X_t = 0.5 X_{t-1} - 0.25 X_{t-2} + W_t] with $W_t \sim WN(0, 1)$. Within two of the processes, we apply either a process-wise contamination or a pointwise contamination. The process wise contamination, $Y_t$, has a portion of the original process, $X_t$ replaced with concurrent samples from [U_t = 0.90 U_{t-1} - 0.40 U_{t-2} + V_t], where $V_t \sim WN(0, 9)$ whereas the point-wise contaimination, $Z_t$, has points selected at random from $X_t$ and replaced with $N_t \sim WN(0, 9)$.

``` {r simuAR2data, cache = TRUE} n = 1000 # Sample Size gamma = 0.05 # Level of contamination dispersal = round(gamma*n) # Amount of samples contaminated

set.seed(19) # Set seed for reproducibility

Generate data!

Xt = gen_gts(n, AR(phi = c(0.5,0.25), sigma2 = 1)) Yt = gen_gts(n, AR(phi = c(0.5,0.25), sigma2 = 1)) Zt = gen_gts(n, AR(phi = c(0.5,0.25), sigma2 = 1))

Contaminate a portion of Y_t with a process

Yt[101:135,] = gen_gts(35, AR(phi = c(0.9,-0.4), sigma2 = 9))

Contaminate at random Z_t with noise

Zt[sample(n, dispersal, replace = FALSE),] = gen_gts(dispersal, WN(sigma2 = 9))

```r
# Graph points
plot1 = autoplot(Xt) + ylab(expression(X[t]))
plot2 = autoplot(Yt) + ylab(expression(Y[t]))
plot3 = autoplot(Zt) + ylab(expression(Z[t]))
grid.arrange(plot1, plot2, plot3, nrow = 3)

As every process will require the same estimation methodology, we will first define a computational engine that will compute estimates for the parameters across all of the different methods under the simulation study. Doing so provides separate, reusable code that lowers the threshold for error when compared to having multiple copies of the same code scattered about. In addition, we have defined a means to visualize the results obtained from the simulation study through a boxplot. The procedure for creating the graphs is detailed after the simulation engine.

sim_engine = function(X_t){
  # Estimate ARIMA parameters using MLE
  mod = arima(X_t, order = c(2, 0, 0), include.mean  = FALSE)

  # Extract MLE Parameters (including sigma)
  res.MLE = c(mod$coef, mod$sigma)

  # Calculate ACF
  autocorr = as.numeric(acf(X_t, lag.max = 2, plot = FALSE)$acf)
  X = matrix(1,2,2)
  X[1,2] = X[2,1] = autocorr[2]
  y = autocorr[2:3]

  # Compute Least Squares on ACF
  svmat = solve(X)
  phi.LS = svmat%*%y
  sig2.LS = var(X_t)*(1 - t(y)%*%svmat%*%y)
  res.LS = c(phi.LS, sig2.LS)

  # Calculate RACF
  rob.ccf = as.numeric(robacf(X_t, plot=FALSE, type = "covariance")$acf)
  X[1,2] = X[2,1] = rob.ccf[2]/rob.ccf[1]
  y = rob.ccf[2:3]/rob.ccf[1]

  # Compute Least Squares on RACF
  svmat = solve(X)
  phi.RR = svmat%*%y
  sig2.RR = rob.ccf[1]*(1 - t(y)%*%svmat%*%y)
  res.RR = c(phi.RR, sig2.RR)

  # Compute the GMWM Estimator
  res.GMWM = gmwm(ARMA(2,0), X_t)$estimate
  res.RGMWM = gmwm(ARMA(2,0), X_t, robust = TRUE)$estimate

  # Return results
  list(res.MLE = res.MLE, res.LS = res.LS, res.RR = res.RR,
       res.GMWM = res.GMWM, res.RGMWM = res.RGMWM)
}
sim_study_graph = function(res.MLE, res.LS, res.RR, res.GMWM, res.RGMWM,
                           Truth = c(0.5, 0.25, 1)){
  # Converts Simulation Matrices to Long Form for use in ggplot2
  d = balamuta::study_df(res.MLE, res.LS, res.RR, res.GMWM, res.RGMWM,
                         data_names = c("MLE","LS","RLS","GMWM","RGMWM"))

  # Reorder factors
  d$Draw = factor(d$Draw, levels = c(1,2,3),
                  labels = expression(phi[1], phi[2], sigma^2))
  d$Type = factor(d$Type, levels = c("MLE","LS","RLS","GMWM","RGMWM"))

  # Add in horizontal lines
  oracle = data.frame(Draw = 1:3, Truth = Truth)
  oracle$Draw = factor(oracle$Draw, levels = c(1,2,3), 
                       labels = expression(phi[1], phi[2], sigma^2))

  # Plot
  ggplot(d, aes(x = Type, y = Values, fill = Type)) + 
    geom_boxplot() + 
    stat_boxplot(geom ='errorbar', width = 0.5) + 
    geom_hline(data = oracle, aes(yintercept = Truth), color = "red") +
    facet_wrap(~Draw, labeller = label_parsed, scales = "free_y") +
    theme_bw() + 
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 50, hjust = 1))
}

With the means to now compute and visualize parameter estimates for each of the given methods, the bootrap simulation study may now commence. Using the simulation engine, we can simultaneously evaluate the processes within one loop iteration. Please take into consideration that the following simulation study is computationally intensive and may require the amount of iteration $B$ to be decreased on a personal computer to complete in a reasonable time.

``` {r simuAR2study, cache = TRUE, fig.height = 4.5, fig.width = 9}

Number of bootstrap iterations

B = 250

Simulation storage

res.xt.MLE = res.xt.LS = res.xt.RR = res.xt.GMWM = res.xt.RGMWM = matrix(NA, B, 3) res.yt.MLE = res.yt.LS = res.yt.RR = res.yt.GMWM = res.yt.RGMWM = matrix(NA, B, 3) res.zt.MLE = res.zt.LS = res.zt.RR = res.zt.GMWM = res.zt.RGMWM = matrix(NA, B, 3)

Begin bootstrap

for (i in seq_len(B)){

# Set seed for reproducibility set.seed(i)

# Generate processes Xt = gen_gts(n, AR(phi = c(0.5, 0.25), sigma2 = 1)) Yt = gen_gts(n, AR(phi = c(0.5, 0.25), sigma2 = 1)) Zt = gen_gts(n, AR(phi = c(0.5, 0.25), sigma2 = 1))

# Generate Ut contamination process that replaces a portion of original signal Yt[101:135] = gen_gts(35, AR(phi = c(0.9,-0.4), sigma2 = 9))

# Generate Nt contamination that inject noise at random Zt[sample(n, dispersal, replace = FALSE)] = gen_gts(dispersal, WN(sigma2 = 9))

# Compute estimates and store in the appropriate matrix res = sim_engine(Xt) res.xt.MLE[i,] = res$res.MLE; res.xt.LS[i,] = res$res.LS; res.xt.RR[i,] = res$res.RR res.xt.GMWM[i,] = res$res.GMWM;res.xt.RGMWM[i,] = res$res.RGMWM

res = sim_engine(Yt) res.yt.MLE[i,] = res$res.MLE; res.yt.LS[i,] = res$res.LS; res.yt.RR[i,] = res$res.RR res.yt.GMWM[i,] = res$res.GMWM;res.yt.RGMWM[i,] = res$res.RGMWM

res = sim_engine(Zt) res.zt.MLE[i,] = res$res.MLE; res.zt.LS[i,] = res$res.LS; res.zt.RR[i,] = res$res.RR res.zt.GMWM[i,] = res$res.GMWM;res.zt.RGMWM[i,] = res$res.RGMWM }

```r
sim_study_graph(res.xt.MLE, res.xt.LS, res.xt.RR, res.xt.GMWM, res.xt.RGMWM) 
sim_study_graph(res.yt.MLE, res.yt.LS, res.yt.RR, res.yt.GMWM, res.yt.RGMWM)
sim_study_graph(res.zt.MLE, res.zt.LS, res.zt.RR, res.zt.GMWM, res.zt.RGMWM)

After performing the estimation, we opt to visually assess how well each methodology performed. We begin by observing the baseline case regarding the the uncontaminated sample $X_t$ given in Figure \@ref(fig:simuAR2XtResults). Within this context, the estimation methods appear to performing equally across all parameters. Note, it may appear as if $\phi_2$ had a noticable departure from the truth, however, the difference presented is minor considering the y-axis of the graph and parameter recovery present elsewhere. Turning our attention to Figure \@ref(fig:simuAR2YtResults), which displays results from estimating $Y_t$, the process with partial contamination from another; the robust methodologies (R-LS, RGMWM) were able to effectively handle the contamination when compared to the classical approaches (LS, MLE, GMWM) missed the mark greatly on several estimates. From the simulations, there is evidence to suggest that additional study should be performed to assess the performance of R-LS vs. RGMWM. Lastly, we consider the results of estimating $Z_t$, the case of random noise injected into the process, given in Figure \@ref(fig:simAR2ZtResults). The noise has a troubling impact with respect to estimating the large $\phi_1$ and $\sigma^2$. However, the estimation procedure appear to be able to reasonably capture $\phi_2$. As shown previously, the robust techniques provide better estimations than their counterparts.

For all the above methods, it would be necessary to understand how "precise" their estimates are. To do so we would need to obtain confidence intervals for these estimates and this can be done mainly in two manners:

The first approach consists in using the asymptotic distribution of the estimators presented earlier to deliver approximations of the confidence intervals which get better as the length of the observed time series increases. Hence, for example, if we wanted to find a 95% confidence interval for the parameter $\phi$, we would use the quantiles of the normal distribution (given that all methods presented earlier present this asymptotic distribution). However, this approach can present some drawbacks, one of which is there behaviour when the parameters are close to the boundaries of the parameter space. Suppose we consider a realization of length $n = 100$ of the following AR(1) model:

[X_t = 0.96 X_{t-1} + W_t, \;\;\;\; W_t \sim \mathcal{N}(0,1),]

which is depicted in Figure \@ref(fig:simAR1ci)

``` {r simAR1ci, cache = TRUE, fig.height= 4.5, fig.width = 9, fig.cap = "AR(1) with $\phi$ close to parameter bound"} set.seed(7) x = gen_gts(n = 100, AR1(0.96, 1)) autoplot(x)

It can be seen that the parameter $\phi = 0.96$ respects the condition for 
stationarity, $\left| \phi \right| < 1$, but is very close to its boundary.
Using the MLE and the GMWM estimators, we first estimate the parameters and
then compute confidence intervals for $\phi$ using the asymptotic 
normal distribution.

```r
# Compute the parameter estimates using MLE
fit.ML = arima(x, order = c(1,0,0), include.mean = FALSE)
c("phi" = fit.ML$coef, "se" = sqrt(fit.ML$var.coef))

# Construct asymptotic confidence interval for phi
fit.ML$coef + c(-1,1)*1.96*sqrt(fit.ML$var.coef)

# Compute the parameter estimates with inference using GMWM
fit.gmwm = gmwm(AR1(), x)
summary(fit.gmwm, inference = TRUE)$estimate

From the estimation summary, we note that both confidence intervals contain values that would make the AR(1) non-stationary (i.e. values of $\phi$ larger than 1). These estimates are indicating that they represent the estimated (asymptotic) distribution of $\hat{\phi}$ as displayed in Figure \@ref(fig:asymIC).

```r$ for MLE and GMWM parameter estimates. The dashed vertical line represents the true value of $\phi$, the solid line denotes the upper bound of the parameter space for $\phi$ and the vertical ticks represent the limits of the 95% confidence intervals for both methods."}

Define colors

gg_color_hue = function(n, alpha = 1) { hues = seq(15, 375, length = n + 1) hcl(h = hues, l = 65, c = 100, alpha = alpha)[1:n] }

colors = gg_color_hue(6, alpha = 0.2) colors2 = gg_color_hue(6, alpha = 1) phi.sd.gmwm = summary(fit.gmwm)$estimate[,"SE"][1]

par(mfrow = c(1,2)) xx = seq(from = 0, to = 1.2, length.out = 10^3) yy.ML = dnorm(xx, fit.ML$coef, sqrt(fit.ML$var.coef)) plot(NA, xlim = c(0.8,1.04), ylim = range(yy.ML), xlab = expression(phi), ylab = "Density", main = "MLE") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.96, lwd = 2, lty = 2, col = "grey60") polygon(c(xx,rev(xx)), c(rep(0,length(yy.ML)), rev(yy.ML)), border = NA, col = colors[1]) points(qnorm(0.975,fit.ML$coef, sqrt(fit.ML$var.coef)), 0, cex = 3, col = colors2[1], pch = "|") points(qnorm(0.025,fit.ML$coef, sqrt(fit.ML$var.coef)), 0, cex = 3, col = colors2[1], pch = "|")

yy.gmwm = dnorm(xx, fit.gmwm$estimate[1], phi.sd.gmwm) plot(NA, xlim = c(0.7,1.2), ylim = range(yy.gmwm), xlab = expression(phi), ylab = "Density", main = "GMWM") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.96, lwd = 2, lty = 2, col = "grey60") polygon(c(xx,rev(xx)), c(rep(0,length(yy.gmwm)), rev(yy.gmwm)), border = NA, col = colors[2]) points(qnorm(0.975, fit.gmwm$estimate[1], phi.sd.gmwm), 0, cex = 3, col = colors2[2], pch = "|") points(qnorm(0.025, fit.gmwm$estimate[1], phi.sd.gmwm), 0, cex = 3, col = colors2[2], pch = "|")

For this purpose, the confidence intervals are not realistic. Therefore, one viable
solution is to employ a parametric bootstrap, detailed in Theorem \@ref(thm:parabootstrap). 
Indeed, parametric bootstrap takes the estimated parameter values and uses them
in order to simulate from an AR(1) process. For each simulation, the parameters
are estimated again and stored.  Finally, the empirical quantiles at $\alpha/2$ and $1-\alpha/2$
are taken of the saved estimated parameter values to obtrain a confidence interval
which does not suffer from boundary problems. The code below gives an example of 
how this confidence interval is built based on the same estimation procedure but
using parametric bootstrap (using $B = 10000$ bootstrap replicates).

``` {r paraAR1ciest, cache = TRUE, warning = FALSE, fig.height= 6, fig.width= 9}
# Number of Iterations
B = 10000

# Set up storage for results
est.phi.gmwm = rep(NA,B)
est.phi.ML = rep(NA,B)

# Model generation statements
model.gmwm = AR1(fit.gmwm$estimate[1], fit.gmwm$estimate[2])
model.mle = AR1(fit.ML$coef, fit.ML$sigma2)

# Begin bootstrap
for(i in seq_len(B)){

  # Set seed for reproducibility
  set.seed(B + i)

  # Generate process under MLE parameter estimate
  x.star = gen_gts(100, model.mle)

  # Attempt to estimate phi by employing a try
  est.phi.ML[i] = tryCatch(arima(x.star, order = c(1,0,0), include.mean = FALSE)$coef, 
                           error = function(e) NA)

  # Generate process under GMWM parameter estimate
  x.star = gen_gts(100, model.gmwm)
  est.phi.gmwm[i] = gmwm(model.gmwm, x.star)$estimate[1]
}

``` {r paraAR1cigraphs, cache = TRUE, warning = FALSE, echo = FALSE, fig.height = 6, fig.width = 9, fig.cap = "Estimated parametric distributions of $\hat{\phi}$ for MLE and GMWM parameter estimates. The bar plots represent the results from the parametric bootstrap, the densities represent the estimate asymptotic distribution, the vertical line represents the true value of $\phi$, the solid lines denotes the upper bound of the parameter space for $\phi$ and the vertical ticks represent the limits of the 95% confidence intervals under both approaches for the estimation methods."} par(mfrow = c(1,2))

hist.phi.ML = hist(est.phi.ML, plot = FALSE, breaks = 20) hist.phi.ML$counts = hist.phi.ML$counts/sum(hist.phi.ML$counts)/diff(hist.phi.ML$mids)[1] plot(NA, xlim = c(0.65,1.04), ylim = range(yy.ML), xlab = expression(phi), ylab = "Density", main = "MLE") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.96, lwd = 2, lty = 2, col = "grey60") plot(hist.phi.ML, xlim = c(0.6,1.1), ylim = range(yy.ML), add = T, col = colors[3], border = "grey60") polygon(c(xx,rev(xx)), c(rep(0,length(yy.ML)), rev(yy.ML)), border = NA, col = colors[1])

points(qnorm(0.975,fit.ML$coef, sqrt(fit.ML$var.coef)), 0, cex = 3, col = colors2[1], pch = "|") points(qnorm(0.025,fit.ML$coef, sqrt(fit.ML$var.coef)), 0, cex = 3, col = colors2[1], pch = "|")

points(quantile(est.phi.ML, 0.975, na.rm = T), 0, cex = 3, col = colors2[3], pch = "|") points(quantile(est.phi.ML, 0.025, na.rm = T), 0, cex = 3, col = colors2[3], pch = "|")

hist.phi.GMWM = hist(na.omit(est.phi.gmwm), plot = FALSE, breaks = 20) hist.phi.GMWM$counts = hist.phi.GMWM$counts/sum(hist.phi.GMWM$counts)/diff(hist.phi.GMWM$mids)[1] plot(NA, xlim = c(0.53,1.2), ylim = range(hist.phi.GMWM$counts), xlab = expression(phi), ylab = "Density", main = "GMWM") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.96, lwd = 2, lty = 2, col = "grey60") plot(hist.phi.ML, xlim = c(0.53,1.2), ylim = range(hist.phi.GMWM$counts), add = T, col = colors[5], border = "grey60") polygon(c(xx,rev(xx)), c(rep(0,length(yy.gmwm)), rev(yy.gmwm)), border = NA, col = colors[2])

points(qnorm(0.975,fit.gmwm$estimate[1], phi.sd.gmwm), 0, cex = 3, col = colors2[2], pch = "|") points(qnorm(0.025,fit.gmwm$estimate[1], phi.sd.gmwm), 0, cex = 3, col = colors2[2], pch = "|")

points(quantile(est.phi.gmwm, 0.975, na.rm = T), 0, cex = 3, col = colors2[5], pch = "|") points(quantile(est.phi.gmwm, 0.025, na.rm = T), 0, cex = 3, col = colors2[5], pch = "|")

In Figure \@ref(fig:paraAR1cigraphs), we compare the estimated densities
for $\hat{\phi}$ using asymptotic results and bootstrap techniques for the MLE 
and the GMWM estimator. It can be observed that the issue that arose previously 
with unrealistic confidence intervals have been resolved as the interval regions
lie entirely within the boundaries of the parameter space.

To emphasize that the effectiveness of parametric bootstrap, let us
consider one additonal example of an AR(2) process. For this example, discussion
will focus on observing the behavior exhibited solely by the MLE. Having said
this, the AR(2) process is defined to be:

\begin{equation}
{X_t} = {1.98}{X_{t - 1}} - {0.99}{X_{t - 2}} + {W_t}
(\#eq:ciar2)
\end{equation}

where $W_t \sim WN(0, 1)$. The generated process is displayed in Figure \@ref(fig:CIAR2data). 

```r
set.seed(432)
Xt = gen_gts(500, AR(phi = c(1.98, -0.99), sigma2 = 1))
autoplot(Xt)
mod = arima(Xt, c(2,0,0), include.mean = FALSE)
mod

With the model's coefficients readily available, the parametric bootstrap is able to be constructed. As before, the bootstrap implementation mirrors prior versions with one cavaet regarding the storage of $\phi$ being two dimensional and, thus, requiring a matrix to store the result instead of a traditional atomic vector in R.

B = 10000
est.phi.ML = matrix(NA, B, 2)
model = AR(phi = mod$coef, sigma2 = mod$sigma2)

for(i in seq_len(B)) {
  set.seed(B + i)              # Set Seed for reproducibilty

  x.star = gen_gts(500, model) # Simulate the process 

  # Obtain parameter estimate with protection
  # that defaults to NA if unable to estimate
  est.phi.ML[i,] = tryCatch(arima(x.star, order = c(2,0,0),
                                  include.mean = FALSE)$coef,
                            error = function(e) NA)
}

```r$ and $\hat{\phi_2}$ of the MLE using asymptotic and parametric bootstrap techniques. The colored contours represent the density of distribution and the dark grey lines represent the boundary constraints of $\left|\phi_2\right|<1$ and $\phi_2 = 1 - \phi_1$."}

Graphing

Requires use of MASS

library("MASS") library("RColorBrewer") est.phi.ML = na.omit(est.phi.ML) z = kde2d(est.phi.ML[,1],est.phi.ML[,2], n=50) k = 11 my.cols = rev(brewer.pal(k, "RdYlBu"))

bivn = mvrnorm(100000, mu = mod$coef, Sigma = matrix(c(mod$var.coef), 2)) bivn.kde = kde2d(bivn[,1], bivn[,2], n = 50)

par(mfrow = c(1,2)) plot(NA, xlim = c(1.96,2), ylim = c(-1.02,-0.97), xlab = expression(phi[1]), ylab = expression(phi[2]), cex.lab = 1.5, main = "Asymptotic") grid()

Adding boundary of constraint |phi_2| < 1

abline(h = c(-1,1), lty = 2, col = "darkgrey")

Adding boundary of constraint phi_2 = 1 - phi_1

phi1 = seq(from = -2, to = 2, length.out = 10^3) phi2.c1 = 1 - phi1 lines(phi1, phi2.c1, lty = 2, col = "darkgrey")

Adding boundary of constraint phi_2 = 1 + phi_1

phi1 = seq(from = -2, to = 2, length.out = 10^3) phi2.c2 = 1 + phi1 lines(phi1, phi2.c2, lty = 2, col = "darkgrey") contour(bivn.kde, drawlabels=FALSE, nlevels=k, col=my.cols, add = TRUE)

plot(NA, xlim = c(1.96,2), ylim = c(-1.02,-0.97), xlab = expression(phi[1]), ylab = expression(phi[2]), cex.lab = 1.5, main = "Bootstrap") grid()

Adding boundary of constraint |phi_2| < 1

abline(h = c(-1,1), lty = 2, col = "darkgrey")

Adding boundary of constraint phi_2 = 1 - phi_1

phi1 = seq(from = -2, to = 2, length.out = 10^3) phi2.c1 = 1 - phi1 lines(phi1, phi2.c1, lty = 2, col = "darkgrey")

Adding boundary of constraint phi_2 = 1 + phi_1

phi1 = seq(from = -2, to = 2, length.out = 10^3) phi2.c2 = 1 + phi1 lines(phi1, phi2.c2, lty = 2, col = "darkgrey") contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add = TRUE)

The cost of this approach is the assumption the model captures the dependency
structure that is present within the time series. That is to say, we are able to 
successfully a regenerate new time series process that follows the appropriate
distribution for each sampling phase. However, if we are not confident that the 
model we have selected is a valid estimation of the truth, then using parametric
bootstrap with the estimated model is highly problematic as it does not represent
the dependency between observations. To complicate matters further, the
traditional bootstrap that  consists of simple random sampling with replacement,
in the presence of dependency,  would also be highly inaccurate and downright
reckless to use. Therefore, to  preserve the dependency structure of the
original data one would have to use _block bootstrapping_, which is a form of
non-parametric bootstrap. There  are many different kinds of block bootstraps
for time series, which are descendents of the Moving Block Bootstrap (MBB)
presented in \@ref(thm:mbb).

```{theorem, label="mbb", "Moving Block Bootstrap"}
Suppose that $\left(X_t\right)$ is weakly stationary time series with $N$ 
observations. 

1. Divide time series into overlapping blocks $\left\{S_1, \ldots, S_M\right\}$ of
   length $\ell$, $1 \le \ell < N$, resulting in $M = N - \ell + 1$ blocks structured as: 

$$\begin{aligned}   
  {S_1}& = & ({X_1}, &{X_2}, \cdots , {X_\ell}) & && \\
  {S_2}& = & &( {X_2}, {X_3}, \cdots , {X_{\ell + 1}}) & && \\
  & \cdots & & {} & \cdots && \\
  {S_M} & = & & & {} &&( {X_{N-\ell+1}}, {X_{N-\ell+2}}, \cdots , {X_{N}})
\end{aligned}$$

2. Draw $M = \left\lfloor {\frac{N}{\ell}} \right\rfloor$ blocks with replacement
   from these $\left\{S_1, \ldots, S_M\right\}$ blocks and place them in order to form
   $(X_t^*)$. 
3. Compute the statistic of interest on the simulated 
   sample $(X_t^*)$.
4. Repeat Steps 2 and 3 $B$ times where $B$ is sufficiently "large" 
   (typically $100 \leq B \leq 10000$).
5. Compute the empirical mean and variance on the statistic of interest based on 
   the $B$ independent replications. 

The approach taken by MBB ensures that within each block the dependency between observations is preserved. Though, one particular issue that now arises is that some inaccuracy is introduced as a result of successive blocks potentially being independent from each other. In reality, this is one of the trade-offs of the MBB approach that can be mitigated by selecting an optimal $\ell$. To lower the inaccuracy of the procedure the selection of $\ell = N^{1/3}$ as $N \to \infty$ is optimal (proof left as an exercise to the reader). An earlier variant of MBB was called nonoverlapping block bootstrap (NBB), which prohbited the sharing of data described in \@ref(thm:nbb).

```{theorem, label="nbb", "Nonoverlapping Block Bootstrap"} Suppose that $\left(X_t\right)$ is weakly stationary time series with $N$ observations.

  1. Divide time series into nonoverlapping blocks $\left{S_1, \ldots, S_M\right}$ of length $\ell$, $1 \le \ell < N$, resulting in $M = \left\lfloor {\frac{N}{\ell}} \right\rfloor$ blocks structured as:

$$\begin{aligned}
{S_1}& = & ({X_1}, {X_2}, \cdots , {X_\ell})& & && \ {S_2}& = & &( {X_{\ell+1}}, {X_{\ell+2}}, \cdots , {X_{2\ell}}) & && \ & \cdots & & {} & \cdots && \ {S_K} & = & & & {} &&( {X_{\left({}\right)}}, {X_{N-\ell+2}}, \cdots , {X_{N}}) \end{aligned}$$

  1. Draw $M$ blocks with replacement from these $\left{S_1, \ldots, S_M\right}$ blocks and place them in order to form $(X_t^*)$.
  2. Compute the statistic of interest on the simulated sample $(X_t^*)$.
  3. Repeat Steps 2 and 3 $B$ times where $B$ is sufficiently "large" (typically $100 \leq B \leq 10000$).
  4. Compute the empirical mean and variance on the statistic of interest based on the $B$ independent replications.
Alternatively, depending on the case one can also use modification of MBB that
seeks to change how the beginning and end of the time series is weighted
such as a circular block-bootstrap (CBB), that seeks improve dependency by
wrapping observations, or a stationary bootstrap (SB), that randomizes block length
by under a geometric distribution of mean $\ell$. Regardless, the outcomes from
using MBB on time series is considerably better than just resampling, which is 
also possible if we set $\ell = 1$ since the length of $(X_t^*)$ is $M\times\ell$.

Having said this, the implementation of Theorem \@ref(thm:nbb) follows the
similar mold of generating data, estimating, and returning a value. The most
notably deviation from Theorem \@ref(thm:nbb) is the use of an indexing trick to
indicate the start period of each block that allows for only one copy of the data
to be held within memory instead of $m$ blocks of length $\ell$ in addition 
to the initial copy.

```r
ar1_blockboot = function(Xt, block_len = 10, B = 500) {

  n = length(Xt)            # Length of Time Series
  res = rep(NA, B)          # Bootstrapped Statistics
  m = floor(n/block_len)    # Amount of Blocks

  for (i in seq_len(B)) {   # Begin MMB

    set.seed(i + 1199)      # Set seed for reproducibility
    x_star = rep(NA, n)     # Setup storage for new TS

    for (j in seq_len(m)) { # Simulate new time series 

      index = sample(m, 1)  # Randomize block starting points

      # Place block into time series
      x_star[(block_len*(j - 1) + 1):(block_len*j)] = 
            Xt[(block_len*(index - 1) + 1):(block_len*index)]
    }

    # Calculate parameter with protection
    res[i] = tryCatch(arima(x_star, order = c(1,0,0), include.mean = FALSE)$coef,
                      error = function(e) NA)
  }

  na.omit(res)              # Release bootstrap result
}

With the block bootstrapping technique in hand, let us consider a scenario where the model's assumption that the residuals will be Gaussian is violated. Consider two AR(1) processes with the same coefficient but different noise generation procedures:

$$ \begin{aligned} \mathcal{M}1:&{}& X_t &=0.5 X{t-1} + W_t,&{}& W_t\sim \mathcal{N} (0,1) \ \mathcal{M}2:&{}& X_t &=0.5 X{t-1} + V_t,&{}& V_t\sim t_4 \end{aligned} $$ The generation procedure for $\mathcal{M}1$ is straightforward, use: gen_gts(). However, the underlying generating mechanism for gen_gts() relies upon the noise being Gaussian. Therefore, to generate $\mathcal{M}_2$, one must use _R's arima.sim() function with a custom noise generator defined to sample from a $t$ distribution.

set.seed(1)                                        # Set seed for reproducibilty
xt_m1 = gen_gts(500, AR1(phi = 0.5, sigma2 = 1))   # Gaussian noise only
xt_m2 = arima.sim(n = 500, list(ar = 0.5, ma = 0), # Multiple noises supported
                  rand.gen = function(n, ...) rt(n, df = 4))
library(gridExtra) 
plot1 = autoplot(xt_m1) + ylab("Model 1")
plot2 = autoplot(gts(xt_m2)) + ylab("Model 2")
grid.arrange(plot1, plot2, nrow = 2) 

From Figure \@ref(fig:mbbdatavis), the time series look remarkably different as a result of the noise process being altered slightly even though the $\phi$ coefficient was held constant. Principally, the difference is the due related to the variance of the $t$ distribution is defined to be $\frac{\nu}{\nu-2}$ for $\nu > 2$ and $\infty$ for $\nu \le 2$. Therefore, the i.i.d white noise processes generated are $\sigma^2_{\mathcal{M}1} = 1$ when compared to the alternative $\sigma^2{\mathcal{M}_2} = 2$. With this being said, both the underlying nonparametric and parametric bootstrap estimation procedures between the models will be the same. The implementation for blocking approach regarding an AR(1) was discussed previously and the parametric approach is as follows:

ar1_paraboot = function(model, B = 10000) {
  est.phi = rep(NA,B)    # Define a storage vector

  for(i in seq_len(B)) { # Perform bootstrap

    set.seed(B + i)      # Set seed for reproducibility

    # Simulate time series underneath the estimated model
    x.star = arima.sim(n = 500, list(ar = model$coef, ma = 0),
                       sd = sqrt(model$sigma2))

    # Attempt to estimate parameters with recovery
    est.phi[i] = tryCatch(arima(x.star, order = c(1,0,0),
                                   include.mean = FALSE)$coef, 
                          error = function(e) NA)

  }

  na.omit(est.phi)       # Return estimated phis.
}

With both functions in hand, the procedure is regulated to calling them on the different sets of data and models.

B = 10000 

# Model 1
fit_m1_mle = arima(xt_m1, order = c(1,0,0), include.mean = FALSE)
para_m1_phi  = ar1_paraboot(fit_m1_mle, B = B)
block_m1_phi = ar1_blockboot(xt_m1, block_len = 25, B = B)

# Model 2
fit_m2_mle = arima(xt_m2, order = c(1,0,0), include.mean = FALSE)
para_m2_phi  = ar1_paraboot(fit_m2_mle, B = B)
block_m2_phi = ar1_blockboot(xt_m2, block_len = 25, B = B) 

```r$ for MLE parameter estimates. The histogram bars represent the empirical results from the bootstraps with the green representing parametric bootstrap and the red represent the block bootstrap approach. The dashed vertical line represents the true value of $\phi$ and the vertical ticks correspond to the limits of the 95% confidence intervals for both estimation techniques."}

Rewrite as a function later...

Has a dependency on previously written code.

par(mfrow = c(1,2))

Model 1

hist.phi.ML = hist(para_m1_phi, plot = FALSE, breaks = 25) hist.phi.ML$counts = hist.phi.ML$counts/sum(hist.phi.ML$counts)/diff(hist.phi.ML$mids)[1]

hist.phi.ML.bboot = hist(block_m1_phi, plot = FALSE, breaks = 25) hist.phi.ML.bboot$counts = hist.phi.ML.bboot$counts/sum(hist.phi.ML.bboot$counts)/diff(hist.phi.ML.bboot$mids)[1]

plot(NA, xlim = c(0.28,0.6), ylim = range(yy.ML), xlab = expression(phi), ylab = "Density", main = "Model 1") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.95, lwd = 2, lty = 2, col = "grey60") plot(hist.phi.ML, xlim = c(0.28,0.6), ylim = range(yy.ML), add = T, col = colors[3], border = "grey60")

plot(hist.phi.ML.bboot, xlim = c(0.28,0.6), ylim = range(yy.ML), add = T, col = colors[1], border = "grey60")

points(quantile(para_m1_phi, 0.975), 0, cex = 3, col = colors[3], pch = "|") points(quantile(para_m1_phi, 0.025), 0, cex = 3, col = colors[3], pch = "|")

points(quantile(block_m1_phi, 0.975), 0, cex = 3, col = colors[1], pch = "|") points(quantile(block_m1_phi, 0.025), 0, cex = 3, col = colors[1], pch = "|") abline(v = 0.5, lwd = 2, lty = 2, col = "grey60")

Model 2

hist.phi.ML = hist(para_m2_phi, plot = FALSE, breaks = 25) hist.phi.ML$counts = hist.phi.ML$counts/sum(hist.phi.ML$counts)/diff(hist.phi.ML$mids)[1]

hist.phi.ML.bboot = hist(block_m2_phi, plot = FALSE, breaks = 25) hist.phi.ML.bboot$counts = hist.phi.ML.bboot$counts/sum(hist.phi.ML.bboot$counts)/diff(hist.phi.ML.bboot$mids)[1]

plot(NA, xlim = c(0.28,0.6), ylim = range(yy.ML), xlab = expression(phi), ylab = "Density", main = "Model 2") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.95, lwd = 2, lty = 2, col = "grey60") plot(hist.phi.ML, xlim = c(0.28,0.6), ylim = range(yy.ML), add = T, col = colors[3], border = "grey60")

plot(hist.phi.ML.bboot, xlim = c(0.28,0.6), ylim = range(yy.ML), add = T, col = colors[1], border = "grey60")

points(quantile(para_m2_phi, 0.975), 0, cex = 3, col = colors[3], pch = "|") points(quantile(para_m2_phi, 0.025), 0, cex = 3, col = colors[3], pch = "|")

points(quantile(block_m2_phi, 0.975), 0, cex = 3, col = colors[1], pch = "|") points(quantile(block_m2_phi, 0.025), 0, cex = 3, col = colors[1], pch = "|")

abline(v = 0.5, lwd = 2, lty = 2, col = "grey60")

The results from the parametric and nonparametric block bootstrapping techniques
are displayed in Figure \@ref(fig:blockbootmodels). In both instances, we can 
see that the block bootstrap had a narrower distribution. However, there was a
considerable bias that lead the distribution to be off-centered. The origins of
bias come from the independence associated between blocks as they are placed
into $(X^*_t)$. 


### Forecasting AR(p) Models

One of the most interesting things in time series analysis is to predict the 
future unobserved values based on the observed values up to now. However, it is
not possible if the underlying model is unknown, thus in this section we assume 
the time series $X_t$ is stationary and its model is known. In particular, we
denote forecasts by $X^{n}_{n+m}$, where $n$ represents the data points collected
(e.g. $\bm{X} = \{X_{1}, X_{2}, \cdots , X_{n-1}, X_n\}$) and $m$ represents the amount
of points in the future we wish to predict. So, $X^{n}_{n+1}$ represents a
one-step-ahead prediction $X_{n+1}$ given data $\{X_{1}, X_{2}, \cdots, X_{n-1}, X_{n}\}$.

To obtain forecasts, we rely upon the best linear prediction (BLP). Recall that
the BLP Definition \@ref(def:blp) is mentioned when we calculate the PACF of AR model.
In general, the best approach to finding the BLP is to use the Theorem \@ref(thm:projtheo).

```{theorem, label="projtheo", name="Projection Theorem"}
Let $\mathcal{M} \subset \mathcal{L}_2$ be a closed linear subspace of Hibert space. 
For every $y \in \mathcal{L}_2$, there exists a unique element $\hat{y} \in \mathcal{M}$ that
minimizes $||y - l||^2$ over $l \in \mathcal{M}$. This element is uniquely 
determined by the requirements

1. $\hat{y} \in \mathcal{M}$ and 
2. $(y - \hat{y}) \perp \mathcal{M}$.

Projection theorem natually leads to an equivalent way to find the best linear predictor by solving the prediction equations, [ \mathbb{E}(X_{t+h} - \hat{X}{t+h}) = 0, \mbox{( it is trivial for TS with zero mean)}] and [ \mathbb{E} [(X{t+h} - \hat{X}_{t+h})X_j ] = 0, \mbox{ for } i = 1, \dots, t.]

Write $\mathbb{E}(X_{i}, X_{j})$ as $\gamma(|i - j|)$, predition equations can be represented by the following form

\begin{equation} \begin{aligned} \begin{pmatrix} \gamma(0) & \gamma(1) & \cdots & \gamma(N-1) \ \gamma(1) & \gamma(0) & \cdots & \gamma(N-2) \ \vdots & \vdots & \ddots & \vdots \ \gamma(N-1) & \gamma(N-2) & \cdots &\gamma(0) \end{pmatrix}{N \times N} \begin{pmatrix} \alpha_1 \ \vdots \ \alpha_N \end{pmatrix}{N \times 1} &= \begin{pmatrix} \gamma(N+h-1) \ \vdots \ \gamma(h) \end{pmatrix}_{N \times 1} \ \Gamma_N \bm{\alpha}_N &= \bm{\gamma}_N \end{aligned} \end{equation}.

Assuming that $\Gamma_N$ is non-singular, then the values of $\bm{\alpha}_N$ are given as:

$$\bm{\alpha}_N = \Gamma^{-1}_N\bm{\gamma}_N$$

\[X_{t + j}^t = E\left[ {{X_{t + j}}|{X_t}, \cdots ,{X_1}} \right] = {E_t}\left[ {{X_{t + j}}} \right],j > 0\]

Then \[E\left[ {{{\left( {{X_{t + j}} - m\left( {{X_1}, \cdots ,{X_t}} \right)} \right)}^2}} \right] \ge E\left[ {{{\left( {{X_{t + j}} - X_{t + j}^t} \right)}^2}} \right]\] for function $m(.)$.
Consider:

$\widetilde{\beta}^{*} = \mathop {\arg \min }\limits_\beta  \underbrace {E\left[ {{{\left( {{X_{t + h}} - m\left( {{X_{t + 1}}, \cdots ,{X_{t + h}}} \right)} \right)}^2}} \right]}_{MSE\left( \beta  \right)}$

and

$\widetilde{\beta} = \mathop {\arg \min }\limits_\beta  \underbrace {E\left[ {{{\left( {{X_{t + h}} - \sum\limits_{j = 1}^{h - 1} {{\beta _j}{X_{t + j}}} } \right)}^2}} \right]}_{MS{E_L}\left( \beta  \right)}$


It is clear that:

\[MS{E_L}\left( {\tilde \beta } \right) \ge MSE\left( {{{\tilde \beta }^*}} \right)\]

Let's now consider 

\[{\left( {{X_t} - m} \right)^2} = {\left[ {\left( {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]} \right) + \left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)} \right]^2}\]

where we have: $m = m\left( {{X_{t + 1}}, \cdots ,{X_{t + h}}} \right)$ and ${\Omega _t} = \left( {{X_{t + 1}}, \cdots ,{X_{t + h}}} \right)$.

Therefore, 

\[{\left( {{X_t} - m} \right)^2} = {\left( {{X_t} - E\left[ {{X_t}|{\Omega _T}} \right]} \right)^2} + {\left( {E\left[ {{X_t}|{\Omega _T}} \right] - m} \right)^2} + 2\left( {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]} \right)\left( {E\left[ {{X_t}|{\Omega _T}} \right] - m} \right)\]


Focusing on only the last term, we have that:

\[\underbrace {\left( {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]} \right)}_{ = {\varepsilon _t}}\left( {E\left[ {{X_t}|{\Omega _T}} \right] - m} \right)\]

\[E\left[ {{\varepsilon _t}|{\Omega _t}} \right] = E\left[ {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]|{\Omega _t}} \right] = 0\]

So, by the decomposition property, we have that:

\[E\left[ {{\varepsilon _t}\left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)} \right] = E\left[ {E\left[ {{\varepsilon _t}\left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)|{\Omega _t}} \right]} \right] = E\left[ {\underbrace {E\left[ {{\varepsilon _t}|{\Omega _t}} \right]}_{ = 0}\left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)} \right] = 0\]

By the previous discussion, we have that

\[{\left( {{X_t} - m} \right)^2} = {\left( {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]} \right)^2} + {\left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)^2}\]

is therefore minimized for $m = E\left[ {{X_t}|{\Omega _t}} \right]$.

```{example, label="ar1forecast", name="Forecasting with an AR(1)"} We begin the forecasting examples by considering a first order AR(1) process.

$${X_t} = \phi {X_{t - 1}} + {W_t}$$

where $W_t \sim WN(0, \sigma^2_W)$.

From here, the conditional expected mean and variance are given as:

[\begin{aligned} {E_t}\left[ {{X_{t + j}}} \right] &= {\phi ^j}{X_t} \ Va{r_t}\left[ {{X_{t + j}}} \right] &= \left( {1 + {\phi ^2} + {\phi ^4} + \cdots + {\phi ^{2\left( {j - 1} \right)}}} \right){\sigma ^2} \ \end{aligned} ]

Within this derivation, it is important to remember that:

[\begin{aligned} \mathop {\lim }\limits_{j \to \infty } {E_t}\left[ {{X_{t + j}}} \right] &= 0 = E\left[ {{X_t}} \right] \ \mathop {\lim }\limits_{j \to \infty } Va{r_t}\left[ {{X_{t + j}}} \right] &= \frac{{{\sigma ^2}}}{{1 - {\phi ^2}}} = \operatorname{var} \left( {{X_t}} \right)
\end{aligned} ]

```{example, label="ar2forecast", name="Forecasting with an AR(2)"}
Consider an AR(2) process with the following parametrization:

$${X_t} = {\phi _1}{X_{t - 1}} + {\phi _2}{X_{t - 2}} + {W_t}$$

where $W_t \sim WN(0, \sigma^2_W)$.

Then, we are able to find the BLP for each step by:

\[\begin{aligned}
  {E_t}\left[ {{X_{t + 1}}} \right] &= {\phi _1}{X_t} + {\phi _2}{X_{t - 1}} \\
  {E_t}\left[ {{X_{t + 2}}} \right] &= {E_t}\left[ {{\phi _1}{X_{t + 1}} + {\phi _2}{X_t}} \right] = {\phi _1}{E_t}\left[ {{X_{t + 1}}} \right] + {\phi _2}{X_t} \\
   &= {\phi _1}\left( {{\phi _1}{X_t} + {\phi _2}{X_{t - 1}}} \right) + {\phi _2}{X_t} \\
   &= \left( {\phi _1^2 + {\phi _2}} \right){X_t} + {\phi _1}{\phi _2}{X_{t - 1}} \\ 
\end{aligned} \]

```{example, label="arpforecast", name="Forecasting with an AR(P)"} Consider AR(P) process given as:

$${X_t} = {\phi 1}{X{t - 1}} + {\phi 2}{X{t - 2}} + \cdots + {\phi p}{X{t - p}} + {W_t}$$

where $W_t \sim WN(0, \sigma^2_W)$.

The process can be rearranged into matrix form denoted by:

[\begin{aligned} \underbrace {\left[ {\begin{array}{{20}{c}} {{X_t}} \ \vdots \ \vdots \ {{X_{t - p + 1}}} \end{array}} \right]}_{{Y_t}} &= \underbrace {\left[ {\begin{array}{{20}{c}} {{\phi 1}}& \cdots &{}&{{\phi _p}} \ {}&{}&{}&0 \ {}&{{I{p - 1}}}&{}& \vdots \ {}&{}&{}&0 \end{array}} \right]}A\underbrace {\left[ {\begin{array}{{20}{c}} {{X_{t - 1}}} \ \vdots \ \vdots \ {{X_{t - p}}} \end{array}} \right]}{{Y{t - 1}}} + \underbrace {\left[ {\begin{array}{{20}{c}} 1 \ 0 \ \vdots \ 0 \end{array}} \right]}_C{W_t} \ {Y_t} &= A{Y{t - 1}} + C{W_t} \end{aligned}]

From here, the conditional mean and variance are obtainable via:

[\begin{aligned} {E_t}\left[ {{Y_{t + j}}} \right] &= {E_t}\left[ {A{Y_{t + j - 1}} + C{W_{t + j}}} \right] = {E_t}\left[ {A{Y_{t + j - 1}}} \right] + \underbrace {{E_t}\left[ {C{W_{t + j}}} \right]}{ = 0} \ &= {E_t}\left[ {A\left( {A{Y{t + j - 2}} + C{W_{t + j - 1}}} \right)} \right] = {E_t}\left[ {{A^2}{Y_{t + j - 2}}} \right] = {A^j}{Y_t} \ {\operatorname{var} t}\left( {{Y{t + j}}} \right) &= {\operatorname{var} t}\left( {A{Y{t + j - 1}} + C{W_{t + j}}} \right) \ &= {\sigma ^2}C{C^T} + {\operatorname{var} t}\left( {A{Y{t + j - 1}}} \right) = {\sigma ^2}A{\operatorname{var} t}\left( {{Y{t + j - 1}}} \right){A^T} \ &= {\sigma ^2}C{C^T} + {\sigma ^2}AC{C^T}A + {\sigma ^2}{A^2}{\operatorname{var} t}\left( {{Y{t + j - 2}}} \right){\left( {{A^2}} \right)^T} \ &= {\sigma ^2}\sum\limits_{k = 0}^{j - 1} {{A^k}C{C^T}{{\left( {{A^K}} \right)}^T}} \ \end{aligned} ]

By the recursive properties evident within both the conditional mean and variance, the evaluation can be written to take advantage of these features by the following recursive formulation: [\begin{aligned} {E_t}\left[ {{Y_{t + j}}} \right] &= A{E_t}\left[ {{Y_{t + j - 1}}} \right] \ {\operatorname{var} t}\left[ {{Y{t + j}}} \right] &= {\sigma ^2}C{C^T} + A{\operatorname{var} t}\left( {{Y{t + j - 1}}} \right){A^T} \ \end{aligned} ]

```{example, label="rear2forecast", name="Forecasting with an AR(2) in Matrix Form"}
With the newfound knowledge regarding the recursive form of matrix predictions in hand,
we return to the previous AR(2) forecasting example.

\[\begin{aligned} \underbrace {\left[ {\begin{array}{*{20}{c}}
  {{X_t}} \\ 
  {{X_{t - 1}}} 
\end{array}} \right]}_{{Y_t}} &= \underbrace {\left[ {\begin{array}{*{20}{c}}
  {{\phi _1}}&{{\phi _2}} \\ 
  1&0 
\end{array}} \right]}_A\underbrace {\left[ {\begin{array}{*{20}{c}}
  {{X_{t - 1}}} \\ 
  {{X_{t - 2}}} 
\end{array}} \right]}_{{Y_{t - 1}}} + \underbrace {\left[ {\begin{array}{*{20}{c}}
  1 \\ 
  0 
\end{array}} \right]}_C{W_t} \\
  {Y_t} &= A{Y_{t - 1}} + C{W_t} 
\end{aligned}\]

Then, we are able to calculate the BLP as:

\[\begin{aligned}
  {E_t}\left[ {{Y_{t + 2}}} \right] &= {A^2}{Y_t} = \left[ {\begin{array}{*{20}{c}}
  {{\phi _1}}&{{\phi _2}} \\ 
  1&0 
\end{array}} \right]\left[ {\begin{array}{*{20}{c}}
  {{\phi _1}}&{{\phi _2}} \\ 
  1&0 
\end{array}} \right]\left[ {\begin{array}{*{20}{c}}
  {{X_t}} \\ 
  {{X_{t - 1}}} 
\end{array}} \right] \hfill \\
   &= \left[ {\begin{array}{*{20}{c}}
  {\phi _1^2 + {\phi _2}}&{{\phi _1}{\phi _2}} \\ 
  {{\phi _1}}&{{\phi _2}} 
\end{array}} \right]\left[ {\begin{array}{*{20}{c}}
  {{X_t}} \\ 
  {{X_{t - 1}}} 
\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}
  {\left( {\phi _1^2 + {\phi _2}} \right){X_t} + {\phi _1}{\phi _2}{X_{t - 1}}} \\ 
  {{\phi _1}{X_t} + {\phi _2}{X_{t - 1}}} 
\end{array}} \right] \hfill \\ 
\end{aligned} \]

Under the recursive formulation, we can see through a simulation study how the predictions for an AR(2) process with $\phi _1 = 0.75$ and $\phi _2 = 0.2$ vary across time in Figure \@ref(fig:predplot).

library(gmwm)

gg_color_hue <- function(n, alpha = 1) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100, alpha = alpha)[1:n]
}

color = gg_color_hue(5, alpha = 1)
color2 = gg_color_hue(3, alpha = 0.01)

set.seed(2)
Xt = gen_gts(n = 200, AR(phi = c(0.75, 0.2), sigma2 = 1))

plot(NA, xlim = c(0,250), ylim  = c(-10,15), ylab="Prediction", xlab="Time")
grid()
lines(1:200,Xt, col = color[4])


B = 5000
mat = matrix(0,B,52)
mat[,1] = rep(Xt[199], B)
mat[,2] = rep(Xt[200], B)

for (i in 1:B){
  for (j in 3:52){
    mat[i,j] = 0.75*mat[i,(j-1)] + 0.2*mat[i,(j-2)] + rnorm(1)
  }
  lines(200:250, mat[i,2:52], col = color2[3])
}

quant = matrix(NA, 51,5)
quant[1,] = rep(Xt[200],1)

for (i in 2:51){
  quant[i,] = quantile(mat[,i+1], c(0.95,0.75,0.5,0.25,0.05))
}

lines(200:250, quant[,1], col = color[1], lty = 1)
lines(200:250, quant[,5], col = color[1], lty = 1)

lines(200:250, quant[,2], col = color[3], lty = 1)
lines(200:250, quant[,4], col = color[3], lty = 1)

lines(200:250, quant[,3], col = color[5], lty = 1)

Measuring Forecast Accuracy

Within time series, it is often difficult to apply traditional methodology relating concept of "training" and "test" data set. The reason why these concepts are a bit foreign in time series due to the temporal dependence of the data. For instance, if we randomly sampled points within the time series, then there would be some missing values that would need to be imputed. Therefore, to obtain an out-of-sample test the only viable option would be to obtain a window period, say $t = 1, \cdots , N − 1$, to predict $t = N$. That is to say, the training sample runs from the beginning up until "today" and then the hold out data is then "today" or "tomorrow".

Thus, time series models are validated on a notion of a "rolling forecasting origin." However, depending on the context, they may also be referred to as "walk forward optimization", "rolling horizon" or simply "moving origin". The traditional rule of thumb regarding having a training data set of about 2/3rds of the data and a test set of about 1/3 still holds.

There are different forecast measures that should be used to measure the strength of the window and, subsequently, the model. There are two prevalent measures that will be emphasized within this text: Median Prediction Error (MAPE) and Mean-squared Prediction Error (MSPE). Both are defined as follows:

\begin{align} MAPE &= \mathop {median}\limits_{t = 1, \cdots ,t - 1} \left| {{{\hat E}t}\left[ {{X{t + 1}}} \right] - {X_{t + 1}}} \right| = \mathop {median}\limits_{t = 1, \cdots ,t - 1} \left| {{{\hat E}t}\left[ {r_i}\right]} \right| \ MSPE &= \frac{1}{{n - m}}\sum\limits{t = m,n}^{n - 1} {{{\left( {{{\hat E}t}\left[ {{X{t + 1}}} \right] - {X_{t + 1}}} \right)}^2}} = \frac{1}{{n - m}}\sum\limits_{t = m,n}^{n - 1} {{r_i}^2} \end{align}

```{proposition, name="Rolling Forecasting Origin"} The rolling forecast origin algorithm can be described as follows:

  1. Divide the data into a "training data" set that ranges from $t = 1, \ldots, m$ where $m$ is obtained by $m=\left \lfloor \frac{2N}{3} \right \rfloor$ and a testing data set $t = m+1, \ldots, N-1$
  2. Compute $\hat{\theta}$ on $X_t$ where $t = 1, \ldots, m$
  3. Using $\hat{\theta}$, compute ${\hat E_m}\left[ {{X_{m + 1}}} \right] = {\hat \phi 1}{X_m} + \cdots + {\hat \phi _p}{X{m + 1 - p}}$.
  4. ${r_i} = {\hat E_m}\left[ {{X_{m + 1}}} \right] - {X_{m + 1}}$
  5. $i = i + 1$, $m = m + 1$, go to Step 2. until $m = N - 1$ then go to 6.
  6. Compute desired forcast measure
Given all of the above discussion, there is one caveat that would enable data
to be separate completely. The caveat is based upon the ability to take 
distinct temporal sets such as year 1, year 2, and so on to fit individual models
to each time period. The stability of parameter estimates could then be
compared across years.


### Model Diagnostics

Once a model has been obtained, the validity of the model must be assessed.
Indeed, one of corner stones of statistics is in assess how well
the model fits the data. The diagnostic process associated with verifying that
the model  provides an appropriate approximation of the true process relies upon
an analysis of the residuals from the fit.

Now, the diagnostic analysis is somewhat related to the concept of model 
selection addressed within the next subsection \@ref(model-selection). However, 
they are distinctly different topics. For instance, the best model selected may 
lead to residuals that do not conform to the appropriate assumptions made by the
time series model. 

With this in mind, we begin by stating that the assumptions each models in time
series makes about residuals:

1. Between all time lags the residuals have no dependency.
2. There is a constant variance among residuals.
3. The mean of the residuals is zero.
4. The residuals are normally distributed.

Typically, within the diagnostics for time series, the violations occur
around point 1 and 2. Generally, if 1 is violated, this indicates that a
better model can be found. If 2 or 3 is present that indicates a trend is still
present within the time series. The last point is a nice benefit.

For addressing time series diagnostics, we've opted to create a package 
called `exts` that houses a  majority of time series specific diagnostic and
exploratory functions. These functions augument greatly that of what is
currently available in base R with convenient syntax. More details can be
found at <http://smac-group.com/docs/exts>.

The motivating process that we will use to explore the diagnostic information
will be an AR(3) process defined as:

$${X_t} = {.8}{X_{t - 3}} + {W_t}$$

where $W_t \sim WN(0, \sigma^2_W)$.

```r
library("exts")

# Set seed for reproducibility
set.seed(9123)

# Generate AR(3)
Xt = gen_gts(300, AR(phi = c(0, 0, 0.8), sigma2 = 1))

# View time series
autoplot(Xt)

With the data in hand, we will use our apriori knowledge to fit an AR(3) model.

# Fit time series without mean.
model = arima(Xt, order = c(3,0,0), include.mean = T)
model

Now, there are two ways to view residuals from the model: untouched or standardized. The later is the preferred method and will be used throughout this section.

To address the first point regarding the residual dependency across time lags, we turn to the ACF. In particular, we are interested in observing both the classical and robust versions of the ACF to see if there may be problematic values within the residual time period in addition to dependency.

# Standardize the residuals
stdres = model$residuals/sd(model$residuals)

# Graph both R/ACF
autoplot(compare_acf(stdres))

However, by solely observing a graph, we neglecting the efficient portmanteau tests discussed within Section \@ref(portmanteau-test). In particular, these forms of test provide inference as to whether or not serial dependency is present at a given lag. Most prominently, there are two forms to assess the appearance of white noise within residuals: Ljung-Box and Box-Pierce. The results of which can be obtained by:

autoplot(diag_ljungbox(model, stdres = T))
autoplot(diag_boxpierce(model, stdres = T))

Within both graphs, note that at the first test point there is evidence to suggest that serial correlation is present. However, the majority of points all indicate that the model provides a good fit. Thus, without knowing if the model true model parametrization, one may wish to seek a higher order to verify the residuals have no dependency.

After having assessed the serially correlation aspect of the residuals, the attention should now be turned to observing the normality and level of skew associated with the residuals. Both of these traits are akin to what is traditionally associated with a linear regression model. That is, one seeks to assess a histogram or quantile-quantile (qq) plot of the residuals.

To assist in this assessment, the traditional histogram has been modified to have two different density curves. The blue curve matches the theoretical normal density while the grey curve is a kernel approximation of the density associated with the empirical residuals. Generally, both of these lines should be quite similar. In addition, the empirical results must possess the appearance of a normal bell-shaped distribution.

For the QQ Plot, the main principle is that the residuals adhere to the QQ Line that is formed from the first quantile to the third quantile of the distribution. If residuals are either too high, low, or both the tails fall off of the QQ line, then there is likely an issue with the underlying distribution. The issue can be a variety of things from light to heavy tails, right to left skewed, or even being bimodal.

autoplot(diag_resid(model, std = TRUE))
autoplot(diag_qq(model, std = TRUE))

Now, each of these graphs taken independently are only one part of the larger picture. Often, it is better to view each graph in aggregation alongside other diagnostic information. Therefore, the ideal function to use is diag_ts(). The function provides all of the above graphs embedded within the same display to enable a quick scan of the model information.

autoplot(diag_ts(model, Xt, std = TRUE))

Moving Average Models

A moving average model can be interpreted in a similar way to an AR(p) model, except that in this case the time series is the result of a linear operation on the innovation process rather than on the time series itself.

```{definition, label="maq", name="Moving Average of Order Q"} A Moving Average of Order Q or MA(q) model is defined as follows:

$${X_t} = \theta_1 W_{t-1} + ... + \theta_q W_{t-q} + W_t$$.

```{definition, label="maqo", name="Moving Average Operator"}
Following the same logic as for the AR(p) models and using the backshift operator
as before, we can re-express these moving average processes as follows
$$X_t = \theta(B)W_t$$
where $\theta(B) = 1 + \theta_1B + ... + \theta_qB^q$.

These processes are always stationary, no matter the values that $\theta_q$ takes. However, the MA(q) processes may not be identifiable through their autocovariance functions. By the latter we mean that different parameteres for a same order MA(q) model can deliver the exact same autocovariance function and it would therefore be impossible to retrieve the parameters of the model by only looking at the autocovariance function.

```{example, name="Writing a Moving Average in Operator Form"} Consider the following MA(q) process:

$${X_t} = \theta_1 W_{t-1} + ... + \theta_q W_{t-q} + W_t$$

With a little bit of work this can be condensed to:

$${X_t} = {W_t} + \sum\limits_{i = 1}^q {{\theta i}{W{t - i}}}$$

The operator form then follows from:

[{X_t} = {W_t} + \sum\limits_{i = 1}^q {{\theta i}{B^i}{W_t}} \Leftrightarrow {X_t} = \left( {1 + \sum\limits{i = 1}^q {{\theta _i}{B^i}} } \right){W_t} \Leftrightarrow {X_t} = \theta \left( B \right){W_t}]

```{example, name="Stationarity of an MA(Q) Process"}
The stationarity of an MA(q) process is able to be shown assuming that $\sum\limits_{j = 1}^q {\theta _j^2}  < \infty$.

First, the expectation is able to be derived to be $E\left[ {{X_t}} \right] = 0$.

Prior to proceeding to the computation for autocovariance, note that if $\theta_0 = 1$, then 

$${X_t} = {W_t} + \sum\limits_{i = 1}^q {{\theta _i}{W_{t - i}}}$$

may be written succiently as $X_t = \sum\limits_{i = 0}^q {{\theta _i}{W_{t - i}}}$.

Therefore, the autocovariance is obtainable by:

\begin{align}
\cov \left( {{X_{t + h}},{X_t}} \right) &= \cov \left( {\sum\limits_{j = 0}^q {{\theta _j}{W_{t + h - j}}} ,\sum\limits_{i = 0}^q {{\theta _i}{W_{t - i}}} } \right) \\
&= \sum\limits_{j = 0}^q {\sum\limits_{i = 0}^q {{\theta _j}{\theta _i}\cov \left( {{W_{t + h - j}},{W_{t - j}}} \right)} }  \\
&= \underbrace {\sum\limits_{j = 0}^q {\sum\limits_{i = 0}^q {{\theta _j}{\theta _i}\underbrace {\cov \left( {{W_{t + h - j}},{W_{t - j}}} \right)}_{ = 0}} } }_{j \ne i - h} + {1_{\left\{ {\left| h \right| \leqslant q} \right\}}}\sum\limits_{j = 0}^{q - \left| h \right|} {{\theta _{j + \left| h \right|}}{\theta _j}\cov \left( {{W_t},{W_t}} \right)} \\
&= {1_{\left\{ {\left| h \right| \leqslant q} \right\}}}{\sigma ^2}\sum\limits_{j = 0}^{q - \left| h \right|} {{\theta _{j + \left| h \right|}}{\theta _j}}
\end{align}

As a result, we have:

\[{1_{\left\{ {\left| h \right| \leqslant q} \right\}}}{\sigma ^2}\sum\limits_{j = 0}^{q - \left| h \right|} {{\theta _{j + \left| h \right|}}{\theta _j}}  \leqslant {\sigma ^2}\sum\limits_{j = 0}^q {\theta _j^2}  < \infty \]

Hence, the process is stationary.

```{example, name="Non-uniqueness of MA models"} One particular issue of MA models is the fact that they are not unique. In essence, one is not able to correctly tell if the process is of one model or another. Consider the following two models:

$$ \begin{aligned} \mathcal{M}1:&{}& X_t &= W{t-1} + \frac{1}{\theta}W_t,&{}& W_t\sim \mathcal{N} (0, \sigma^2\theta^2) \ \mathcal{M}2:&{}& Y_t &= V{t-1} + \theta V_t,&{}& V_t\sim \mathcal{N} (0,\sigma^2) \end{aligned} $$

By observation, one can note that the models share the same expectation:

[E\left[ {{X_t}} \right] = E\left[ {{Y_t}} \right] = 0]

However, for the autocovariance, the process requires a bit more effort.

\begin{align} \cov \left( {{X_t},{X_{t + h}}} \right) &= \cov \left( {{W_t} + \frac{1}{\theta }{W_{t - 1}},{W_{t + h}} + \frac{1}{\theta }{W_{t + h - 1}}} \right) = {1_{\left{ {h = 0} \right}}}{\sigma ^2}{\theta ^2} + {\sigma ^2} + {1_{\left{ {\left| h \right| = 1} \right}}}\frac{{{\sigma ^2}{\theta ^2}}}{\theta } = {\sigma ^2}\left( {{1_{\left{ {h = 0} \right}}}{\theta ^2} + 1 + {1_{\left{ {\left| h \right| = 1} \right}}}\theta } \right) \ \cov \left( {{Y_t},{Y_{t + h}}} \right) &= \cov \left( {{V_t} + \theta {V_{t - 1}},{V_{t + h}} + \theta {V_{t + h - 1}}} \right) = {1_{\left{ {h = 0} \right}}}{\sigma ^2}{\theta ^2} + {\sigma ^2} + {1_{\left{ {\left| h \right| = 1} \right}}}{\sigma ^2}\theta = {\sigma ^2}\left( {{1_{\left{ {h = 0} \right}}}{\theta ^2} + 1 + {1_{\left{ {\left| h \right| = 1} \right}}}\theta } \right) \end{align}

Therefore, $\cov \left( {{X_t},{X_{t + h}}} \right) = \cov \left( {{Y_t},{Y_{t + h}}} \right)$! Moreover, since $W_t$ and $V_t$ are Gaussian the models are viewed as being similar and, thus, cannot be distinguished. ```

The implication of the last example is rather profound. In particular, consider

[\vec X = \left[ {\begin{array}{{20}{c}} {{X_1}} \ \vdots \ {{X_N}} \end{array}} \right],\vec Y = \left[ {\begin{array}{{20}{c}} {{Y_1}} \ \vdots \ {{Y_N}} \end{array}} \right]]

Thus, the covariance matrix is given by:

[\cov \left( {\vec X} \right) = {\sigma ^2}\left[ {\begin{array}{*{20}{c}} {\left( {1 + {\theta ^2}} \right)}&\theta &0& \cdots &0 \ \theta &{\left( {1 + {\theta ^2}} \right)}&\theta &{}& \vdots \ 0&\theta &{\left( {1 + {\theta ^2}} \right)}&{}&{} \ \vdots &{}&{}& \ddots &{} \ 0& \cdots &{}&{}&{\left( {1 + {\theta ^2}} \right)} \end{array}} \right] = \cov \left( {\vec Y} \right) = \Omega ]

Now, consider the $\vec \beta$ to be the parameter vector for estimates and the approach to estimate is via the MLE:

[L\left( {\vec \beta |\vec X} \right) = {\left( {2\pi } \right)^{ - \frac{N}{2}}}{\left| \Omega \right|^{ - \frac{1}{2}}}\exp \left( { - \frac{1}{2}{{\vec X}^T}{\Omega ^{ - 1}}\vec X} \right)]

If for both models the following parameters ${{\vec \beta }_1} = \left[ {\begin{array}{{20}{c}} \theta \ {{\sigma ^2}} \end{array}} \right],{{\vec \beta }_2} = \left[ {\begin{array}{{20}{c}} {\frac{1}{\theta }} \ {{\sigma ^2}\theta } \end{array}} \right]$ are set, then

[L\left( {{{\vec \beta }_1}|\vec X} \right) = L\left( {{{\vec \beta }_2}|\vec X} \right)]

There is a huge problem being able to identify what the values of the parameters are. To ensure that this problem does not arise in practice, there is the requirement for invertibility, or being able to transform an MA(q) into an AR($\infty$).

Linear Regression

Placeholder

Review on Linear Regression

Linear Regression with Autocorrelated Errors

Model Selection

Placeholder

Derivations of Information Criterion

Properties of Model Selection

Model Selection Simulation Studies

SARIMA Models

Placeholder

(APPENDIX) Appendix {-}

Proofs {#appendixa}

Placeholder

Proof of Theorem 1

Robust Regression Methods {#appendixb}

Placeholder

The Classical Least-Squares Estimator

Robust Estimators for Linear Regression Models

Applications of Robust Estimation



coatless/ITS documentation built on May 13, 2019, 8:45 p.m.